使用llama-factory微调qwen2

由于时效问题，该文某些代码、技术可能已经过期，请注意！！！本文最后更新于：1 年前

llama-factory

llama-factory 是一个用于训练和微调大语言模型的工具，支持多种模型和训练方法。

项目地址：https://github.com/hiyouga/LLaMA-Factory/blob/main/README_zh.md

测试数据使用的是 llama-factory 中的数据集，数据集地址：https://github.com/hiyouga/LLaMA-Factory/tree/main/data

训练代码

llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path ~/model/llm/qwen2/Qwen2-7B-Instruct \     ## 模型地址
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --quantization_method bitsandbytes \
    --template qwen \
    --flash_attn auto \
    --dataset_dir ./data \              ## 数据目录
    --dataset alpaca_zh_demo \          ## 随便选了一个数据集测试
    --cutoff_len 1024 \
    --learning_rate 0.0001 \
    --num_train_epochs 2.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 20 \
    --save_steps 500 \
    --warmup_steps 0 \
    --optim adamw_torch \
    --packing False \
    --report_to none \
    --output_dir ./train_test \         ## 输出目录
    --bf16 True \
    --plot_loss True \
    --ddp_timeout 180000000 \
    --include_num_input_tokens_seen True \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target all

合并基座模型和 adapter参数(训练输出的目录信息)

llamafactory-cli export \
    --model_name_or_path ~/model/llm/qwen2/Qwen2-7B-Instruct \      ## 基座模型
    --adapter_name_or_path ./train_test \                           ## 微调的
    --template qwen \
    --finetuning_type lora \
    --export_dir ./merge \                                          ## 合并保存的目录
    --export_size 2 \
    --export_legacy_format False

测试推理

1	`llamafactory-cli chat inference_lora_sft.yaml`

其中 inference_lora_sft.yaml内容如下：

model_name_or_path: ./merge
#adapter_name_or_path: ./train_test   ## 如果不合并直接推理需要加上
template: qwen
finetuning_type: lora

使用 llama.cpp 转 gguf

1	`python convert-hf-to-gguf.py ./merge --outfile test-fp16.gguf`

量化

1	`llama-quantize test-fp16.gguf q4_0.gguf q4_0`

使用vllm部署

1 2	`# pip install vllm python -m vllm.entrypoints.openai.api_server --model merge`

测试

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "merge",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Tell me something about large language models."}
  ],
  "temperature": 0.7,
  "top_p": 0.8,
  "repetition_penalty": 1.05,
  "max_tokens": 512
}'

增量预训练

1	`llamafactory-cli train qwen2_lora_pretrain.yaml`

其中 qwen2_lora_pretrain.yaml内容如下：

### model
model_name_or_path: /model/llm/qwen2/Qwen2-7B-Instruct/

### method
stage: pt
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: nyu_data
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: train_qwen2
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

PS:数据准备

from bs4 import BeautifulSoup
import os, re
import html2text
from tqdm.notebook import tqdm

converter = html2text.HTML2Text()

def clean_text(text):
    # 去除多余的空白字符
    text = ' '.join(text.split())
    
    # 处理 HTML 实体
    text = BeautifulSoup(text, 'html.parser').text
    
    # 处理换行符
    text = '\n'.join(line.strip() for line in text.splitlines() if line.strip())
    
    # 去除特定的噪音文本（如果有）
    text = re.sub(r'Some unwanted text or pattern', '', text)
    
    # 确保编码为 UTF-8
    text = text.encode('utf-8').decode('utf-8')
    
    return text

def extract_text_from_html(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        html_content = file.read()
        soup = BeautifulSoup(html_content, 'html.parser')
    
        # 查找所有段落或适当的 HTML 标签
        paragraphs = soup.find_all(['p', 'div'])

        content = ''
        for para in paragraphs:
            text = para.get_text().strip()
            text = clean_text(text)
            content += text + '\n'
        # text = converter.handle(html_content)
        # soup = BeautifulSoup(html_content, 'html.parser')
        # text = ' '.join([p.get_text() for p in soup.find_all('p')])
        return content

html_directory = '../data/html/'
all_texts = []

for filename in tqdm(os.listdir(html_directory)):
    if filename.endswith('.html'):
        file_path = os.path.join(html_directory, filename)
        text = extract_text_from_html(file_path)
        # all_texts.append(text)
        all_texts.append({
                "text": text
            })
        # break

保存数据

import json

with open('nyu_data.json', 'w', encoding='utf-8') as file:
    # 使用 indent=2 来设置每个对象占用单独的行，缩进为 2 个空格
    json.dump(all_texts, file, ensure_ascii=False, indent=2)

数据放到data下并修改dataset_info.json的信息，示例如下：

{
  "c4_demo": {
    "file_name": "c4_demo.json",
    "columns": {
      "prompt": "text"
    }
  },
  "nyu_data": {
    "file_name": "nyu_data.json",
    "columns": {
      "prompt": "text"
    }
  }
}

参考：
https://qwen.readthedocs.io/zh-cn/latest/training/SFT/llama_factory.html
https://qwen.readthedocs.io/zh-cn/latest/quantization/gguf.html
https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html

工具

大模型微调

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

使用playwright异步自动化爬虫上一篇

RAG小结下一篇