使用llama-factory微调qwen2

由于时效问题,该文某些代码、技术可能已经过期,请注意!!!本文最后更新于:5 个月前

llama-factory

llama-factory 是一个用于训练和微调大语言模型的工具,支持多种模型和训练方法。

项目地址:https://github.com/hiyouga/LLaMA-Factory/blob/main/README_zh.md

测试数据使用的是 llama-factory 中的数据集,数据集地址:https://github.com/hiyouga/LLaMA-Factory/tree/main/data

训练代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
llamafactory-cli train \
--stage sft \
--do_train True \
--model_name_or_path ~/model/llm/qwen2/Qwen2-7B-Instruct \ ## 模型地址
--preprocessing_num_workers 16 \
--finetuning_type lora \
--quantization_method bitsandbytes \
--template qwen \
--flash_attn auto \
--dataset_dir ./data \ ## 数据目录
--dataset alpaca_zh_demo \ ## 随便选了一个数据集测试
--cutoff_len 1024 \
--learning_rate 0.0001 \
--num_train_epochs 2.0 \
--max_samples 100000 \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--max_grad_norm 1.0 \
--logging_steps 20 \
--save_steps 500 \
--warmup_steps 0 \
--optim adamw_torch \
--packing False \
--report_to none \
--output_dir ./train_test \ ## 输出目录
--bf16 True \
--plot_loss True \
--ddp_timeout 180000000 \
--include_num_input_tokens_seen True \
--lora_rank 8 \
--lora_alpha 16 \
--lora_dropout 0 \
--lora_target all

合并 基座模型 和 adapter参数(训练输出的目录信息)

1
2
3
4
5
6
7
8
llamafactory-cli export \
--model_name_or_path ~/model/llm/qwen2/Qwen2-7B-Instruct \ ## 基座模型
--adapter_name_or_path ./train_test \ ## 微调的
--template qwen \
--finetuning_type lora \
--export_dir ./merge \ ## 合并保存的目录
--export_size 2 \
--export_legacy_format False

测试推理

1
llamafactory-cli chat inference_lora_sft.yaml

其中 inference_lora_sft.yaml内容如下:

1
2
3
4
model_name_or_path: ./merge
#adapter_name_or_path: ./train_test ## 如果不合并直接推理需要加上
template: qwen
finetuning_type: lora

使用 llama.cpp 转 gguf

1
python convert-hf-to-gguf.py ./merge --outfile test-fp16.gguf

量化

1
llama-quantize test-fp16.gguf q4_0.gguf q4_0

使用vllm部署

1
2
# pip install vllm
python -m vllm.entrypoints.openai.api_server --model merge

测试

1
2
3
4
5
6
7
8
9
10
11
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "merge",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me something about large language models."}
],
"temperature": 0.7,
"top_p": 0.8,
"repetition_penalty": 1.05,
"max_tokens": 512
}'

增量预训练

1
llamafactory-cli train qwen2_lora_pretrain.yaml

其中 qwen2_lora_pretrain.yaml内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
### model
model_name_or_path: /model/llm/qwen2/Qwen2-7B-Instruct/

### method
stage: pt
do_train: true
finetuning_type: lora
lora_target: all

### dataset
dataset: nyu_data
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: train_qwen2
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 500

PS:数据准备

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from bs4 import BeautifulSoup
import os, re
import html2text
from tqdm.notebook import tqdm

converter = html2text.HTML2Text()

def clean_text(text):
# 去除多余的空白字符
text = ' '.join(text.split())

# 处理 HTML 实体
text = BeautifulSoup(text, 'html.parser').text

# 处理换行符
text = '\n'.join(line.strip() for line in text.splitlines() if line.strip())

# 去除特定的噪音文本(如果有)
text = re.sub(r'Some unwanted text or pattern', '', text)

# 确保编码为 UTF-8
text = text.encode('utf-8').decode('utf-8')

return text

def extract_text_from_html(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
html_content = file.read()
soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有段落或适当的 HTML 标签
paragraphs = soup.find_all(['p', 'div'])

content = ''
for para in paragraphs:
text = para.get_text().strip()
text = clean_text(text)
content += text + '\n'
# text = converter.handle(html_content)
# soup = BeautifulSoup(html_content, 'html.parser')
# text = ' '.join([p.get_text() for p in soup.find_all('p')])
return content

html_directory = '../data/html/'
all_texts = []

for filename in tqdm(os.listdir(html_directory)):
if filename.endswith('.html'):
file_path = os.path.join(html_directory, filename)
text = extract_text_from_html(file_path)
# all_texts.append(text)
all_texts.append({
"text": text
})
# break

保存数据

1
2
3
4
5
import json

with open('nyu_data.json', 'w', encoding='utf-8') as file:
# 使用 indent=2 来设置每个对象占用单独的行,缩进为 2 个空格
json.dump(all_texts, file, ensure_ascii=False, indent=2)

数据放到data下并修改dataset_info.json的信息,示例如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"c4_demo": {
"file_name": "c4_demo.json",
"columns": {
"prompt": "text"
}
},
"nyu_data": {
"file_name": "nyu_data.json",
"columns": {
"prompt": "text"
}
}
}

参考:
https://qwen.readthedocs.io/zh-cn/latest/training/SFT/llama_factory.html
https://qwen.readthedocs.io/zh-cn/latest/quantization/gguf.html
https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html


本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!