安装

如果是物理机安装,需要

  • 安装 GCC
  • 安装 cuda
  • 安装显卡驱动

推荐容器安装

  • 也需要先安装 显卡驱动
  • PyTorch Release 24.02,nvidia 提供的 pytorch 镜像包,22G
  • 包含 nvidia 相关的工具
  • 镜像包内容详情介绍

docker compose 启动

1
2
3
cd docker/docker-cuda/
docker compose up -d
docker compose exec llamafactory bash

支持 三种类型的 显卡

  • nvidia 的
  • amd 的
  • 华为 的

执行

1
docker compose up -d

这步时也会下载一些东西,需要修改下载的源配置
docker file修改,修改成这个地址

1
ARG PIP_INDEX=https://pypi.tuna.tsinghua.edu.cn/simple

docker-compose yaml 修改:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
services:
  llamafactory:
    build:
      dockerfile: ./docker/docker-cuda/Dockerfile
      context: ../..
      args:
        INSTALL_BNB: false
        INSTALL_VLLM: false
        INSTALL_DEEPSPEED: false
        INSTALL_FLASHATTN: false
        INSTALL_LIGER_KERNEL: false
        INSTALL_HQQ: false
        INSTALL_EETQ: false
        PIP_INDEX: https://pypi.tuna.tsinghua.edu.cn/simple
    container_name: llamafactory
    volumes:
      - ../../hf_cache:/root/.cache/huggingface
      - ../../ms_cache:/root/.cache/modelscope
      - ../../om_cache:/root/.cache/openmind
      - ../../data:/app/data
      - ../../output:/app/output
      - /data2/models:/data
。。。
。。。

如上,修改了 PIP_INDEX,增加了一个 volumes,/data2

进入容器后,执行下面命令,启动 web ui

1
llamafactory-cli webui

微调

命令行启动

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
llamafactory-cli train \
    --stage sft \
    --do_train True \
    --model_name_or_path /data/Qwen2.5-0.5B-Instruct \
    --preprocessing_num_workers 16 \
    --finetuning_type lora \
    --template qwen \
    --flash_attn auto \
    --dataset_dir data \
    --dataset alpaca_zh_demo \
    --cutoff_len 2048 \
    --learning_rate 5e-05 \
    --num_train_epochs 3.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 0 \
    --packing False \
    --report_to none \
    --output_dir saves/Qwen2.5-0.5B-Instruct/lora/train_2024-12-16-05-58-02 \
    --bf16 True \
    --plot_loss True \
    --ddp_timeout 180000000 \
    --optim adamw_torch \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0 \
    --lora_target all

自定义数据集的

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
llamafactory-cli train \
     --stage sft \
     --do_train True \
     --model_name_or_path /data/Qwen2.5-0.5B-Instruct \
     --preprocessing_num_workers 16 \
     --finetuning_type lora \
     --template qwen \
     --flash_attn auto \
     --dataset_dir /app/saves/data/Qwen2.5-0.5B-Instruct/2024-12-23_10_10_51 \
     --dataset cyber_tuning \
     --cutoff_len 2048 \
     --learning_rate 5e-05 \
     --num_train_epochs 3.0 \
     --max_samples 100000 \
     --per_device_train_batch_size 2 \
     --gradient_accumulation_steps 8 \
     --lr_scheduler_type cosine \
     --max_grad_norm 1.0 \
     --logging_steps 5 \
     --save_steps 100 \
     --warmup_steps 0 \
     --packing False \
     --report_to none \
     --output_dir /app/saves/data/Qwen2.5-0.5B-Instruct/2024-12-23_10_10_51 \
     --bf16 True \
     --plot_loss True \
     --ddp_timeout 180000000 \
     --optim adamw_torch \
     --lora_rank 8 \
     --lora_alpha 16 \
     --lora_dropout 0 \
     --lora_target all \
	 --lora_alpha 80 \
     --lora_rank 10 \
	 --use_rslora True

详细分析

指令格式

两种指令格式

  • Alpaca
  • ShareGPT

对比

Feature Alpaca Format ShareGPT Format
Purpose Instruction-following Conversational AI
Structure Instruction, Input (optional), Output Multi-turn dialogue (user-assistant messages)
Flexibility Handles single-turn, structured tasks Handles multi-turn, natural conversations
Example Use Case Chatbot following commands (e.g., write a story) Multi-turn Q&A chatbot (e.g., customer support)

Alpaca Format 例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[
  {
    "instruction": "识别并解释给定列表中的两个科学理论:细胞理论和日心说。",
    "input": "",
    "output": "细胞理论是生物科学的一个理论,它认为所有生命体都是由微小的基本单元——细胞所构成。这是生物学的一个基础理论,认为>细胞是所有生物的基本结构和功能单位,所有的生物都是由一个或多个细胞组成,细胞只能通过细胞分裂产生新的细胞。这一理论由薛定谔、施
瓦内和雪莱于1839年首次提出。\n\n日心说是指太阳是太阳系的中心,也就是说,行星围绕太阳旋转的理论。这个理论打破了传统的地心说观点
,认为地球并不是宇宙的中心。日心说的提出者是尼古拉·哥白尼,他在16世纪初发表了他的著作《天体运行论》,阐述了太阳系行星围绕太阳>运行的模型,为天文学的发展做出了巨大贡献。"
  },
  {
    "instruction": "输入三支篮球队的名称并生成一个适当的口号。",
    "input": "输入:俄克拉荷马城雷霆队,芝加哥公牛队,布鲁克林网队。",
    "output": "输出: \"雷霆,公牛和网队:各显神通,角逐群雄!\""
  },
  {
    "instruction": "比较以下两位著名人物,使用特征列表并以句子形式提供输出。",
    "input": "人物1:阿尔伯特·爱因斯坦\n人物2:斯蒂芬·霍金\n特征:智力、教育、对世界的影响",
    "output": "阿尔伯特·爱因斯坦和斯蒂芬·霍金都是拥有极其出色智力的人物。两人都取得过非常高的教育成就,他们推进了科学发展并在>世界范围内产生了深远的影响。爱因斯坦以其相对论和质能关系公式而闻名,而霍金以其关于黑洞和宇宙的发现而著称。两位科学家都以其深厚
的学识和非凡的贡献影响了世界。"
  },
  {
    "instruction": "请根据以下文章构建一个需要回答的问题。",
    "input": "",
    "output": "我不好意思,您没有提供任何信息或文章来供我根据它构建问题。 请提供相应的文章或信息,以便我能够帮助您构建问题。"
  },
  {
    "instruction": "写一个关于上周末你做了什么的简短叙述。",
    "input": "",
    "output": "由于我是一个人工智能助手,所以我并不能真正的“做”什么事情。我的“工作”时间是24/7无休,确保我随时可供用户提供帮助>和服务。所以在上周末,我一直都在“工作”,在响应用户的指令,帮助他们查询信息,执行操作,解决问题。"
  }
]

ShareGPT Format 例子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
[
  {
    "conversations": [
      {
        "from": "human",
        "value": "Hi, I have some ingredients and I want to cook something. Can you help me find a recipe?"
      },
      {
        "from": "gpt",
        "value": "Of course! I can help you with that. Please tell me what ingredients you have."
      },
      {
        "from": "human",
        "value": "I have chicken, bell peppers, and rice."
      },
      {
        "from": "function_call",
        "value": "{\"name\": \"search_recipes\", \"arguments\": {\"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"]}}"
      },
      {
        "from": "observation",
        "value": "{\"recipes\": [{\"name\": \"Chicken and Bell Pepper Stir Fry\", \"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"], \"instructions\": \"Cut the chicken into small pieces. Slice the bell peppers. Cook the rice. Stir fry the chicken and bell peppers. Serve over rice.\"}, {\"name\": \"Chicken and Rice Casserole\", \"ingredients\": [\"chicken\", \"bell peppers\", \"rice\"], \"instructions\": \"Cook the chicken and rice separately. Mix them together with the bell peppers in a casserole dish. Bake until golden brown.\"}]}"
      },
      {
        "from": "gpt",
        "value": "I found two recipes for you. The first one is \"Chicken and Bell Pepper Stir Fry\". The instructions are: Cut the chicken into small pieces. Slice the bell peppers. Cook the rice. Stir fry the chicken and bell peppers. Serve over rice. The second one is \"Chicken and Rice Casserole\". The instructions are: Cook the chicken and rice separately. Mix them together with the bell peppers in a casserole dish. Bake until golden brown. Which one would you like to try?"
      },
      {
        "from": "human",
        "value": "They both sound delicious, but I think I'll try the stir fry. Can you order the ingredients for me?"
      },
      {
        "from": "gpt",
        "value": "I'm sorry, but as an AI, I don't have the capability to perform external tasks such as ordering ingredients. However, I can help you find more recipes or provide cooking tips if you need."
      }
    ],
    "tools": "[{\"name\": \"search_recipes\", \"description\": \"Search for recipes based on ingredients\", \"parameters\": {\"type\": \"object\", \"properties\": {\"ingredients\": {\"type\": \"array\", \"items\": {\"type\": \"string\"}, \"description\": \"The ingredients to search for\"}}, \"required\": [\"ingredients\"]}}]"
  }
]

微调三种模式

Aspect LoRA (Low-Rank Adaptation) Freeze Full Fine-Tuning
How It Works Adds small trainable low-rank matrices to specific layers (e.g., attention heads) while keeping the base model frozen. Freezes most of the model’s layers and only fine-tunes specific layers (e.g., final layer or task-specific heads). Updates all the parameters of the model, training the entire model for the new task.
Trainable Parameters Introduces new parameters (low-rank matrices), significantly reducing the number of trainable parameters. Trains only the selected layers, while the rest of the model remains unchanged. All parameters are trainable, requiring large computational resources.
Memory/Compute Usage Very efficient; requires far less memory and compute compared to full fine-tuning. Moderate efficiency; requires less memory than full fine-tuning but more than LoRA. High memory and compute usage, as the entire model is updated.
Base Model Dependency The base model remains fully intact; LoRA layers can be added or removed modularly. The base model is partially frozen, and updates are made to selected layers. The base model is entirely modified, making it task-specific.
Modularity Highly modular; you can share LoRA weights separately without altering the base model. Less modular than LoRA; trained layers are not as easily reusable across tasks. Not modular; the fine-tuned model is specific to the task.
Flexibility Designed for specific parts of the model (e.g., attention layers), limiting flexibility. Highly flexible; any layer or combination of layers can be chosen for fine-tuning. Fully flexible; every parameter is trainable.
Fine-Tuning Speed Very fast; only a small number of parameters are optimized. Moderate speed; faster than full fine-tuning but slower than LoRA. Slow; training the entire model requires significant time.
Storage Requirements Stores only the additional LoRA parameters, which are very small in size. Saves the modified layers; smaller storage requirements than full fine-tuning. Requires storing the entire updated model, consuming significant storage.
Resource Efficiency Extremely efficient for large models (e.g., LLaMA, GPT); can work with limited hardware. Less efficient than LoRA but more efficient than full fine-tuning. Requires substantial hardware and resources.
Pretrained Knowledge Preservation Excellent; the base model is untouched, so pre-trained knowledge is preserved. Good; frozen layers retain pre-trained knowledge, while tuned layers adapt to the new task. Risk of overfitting or catastrophic forgetting of pre-trained knowledge.
Use Case Examples Fine-tuning very large models on specific tasks with limited resources. Adapting the model to new tasks by tuning only a few task-relevant layers. Training the model for a completely new domain or task with ample resources.
Common Scenarios Fine-tuning large LLMs like GPT, LLaMA, or BERT on specific downstream tasks. Updating final layers for classification or regression tasks with pre-trained features. Building a domain-specific model from scratch using pre-trained weights.
Key Advantages - Highly efficient. - Small storage overhead. - Easy to switch tasks by loading specific LoRA weights. - Flexible in selecting which layers to train. - Retains much of the pre-trained knowledge. - Full control over the model’s behavior. - Best for completely transforming the model.
Key Disadvantages - Limited to parts of the model where LoRA is implemented. - May not be sufficient for complex adaptations. - Requires careful selection of layers to tune. - May not fully adapt the model to challenging tasks. - Expensive in terms of memory, compute, and storage. - Risk of overfitting.

几种方法

量化等级 (Quantization Level)
Quantization level refers to the precision of numerical representations used during training or inference. Lower precision reduces model size and computation requirements.

  • Options:

    • none: No quantization is applied; the model uses full precision (e.g., FP32 or FP16). This results in the highest accuracy but requires more memory and compute.
    • 4-bit/8-bit: Reduces numerical precision to 4-bit or 8-bit integers or floats, significantly lowering memory and compute usage while sacrificing a small amount of accuracy.
  • Purpose:

    • Reduce memory and computational requirements for large models.
    • Enable deployment on resource-constrained hardware like GPUs with limited memory.

量化方法 (Quantization Method)
Quantization methods define how the quantization process is applied to the model’s weights or activations.

  • Options:

    • bitsandbytes:
      • A widely-used library for efficient quantization.
      • Supports 4-bit and 8-bit quantization for large models with negligible performance degradation.
      • Commonly used with LLMs like GPT and LLaMA.
    • hqq:
      • A hypothetical or custom quantization algorithm (specific to your framework).
      • Could have specific optimizations for performance or compatibility.
  • Purpose:

    • Reduce model size during training or inference.
    • Allow for high-speed computations on hardware with limited precision support (e.g., GPUs with Tensor Cores).

RoPE插值方法 (RoPE Interpolation Method)
RoPE (Rotary Position Embedding) is a technique used in transformer models to encode positional information.

  • Options:

    • none:
      • No interpolation applied; uses standard rotary embeddings as defined during training.
      • Suitable for fixed-length input sequences.
    • linear:
      • Applies linear interpolation to extend the range of rotary embeddings.
      • Useful for models trained on shorter sequences but used for longer sequences during inference.
    • dynamic:
      • Dynamically adjusts rotary embeddings based on the sequence length or other factors.
      • Provides more flexibility but may introduce overhead.
  • Purpose:

    • Enhance the model’s ability to process sequences of varying lengths.
    • Adapt rotary embeddings to new contexts or longer sequences.

加速方法 (Acceleration Method)
Acceleration methods refer to optimizations used to speed up model inference or training by leveraging efficient algorithms or hardware-specific features.

  • Options:

    • auto:
      • Automatically selects the best acceleration method based on the hardware and framework configuration.
    • flashattn2:
      • Refers to Flash Attention 2, an optimized attention mechanism that improves speed and memory usage for transformer models.
      • Ideal for large-scale models and long sequences.
    • unsloth:
      • Likely a custom or experimental acceleration method (specific to your framework).
      • May focus on balancing memory efficiency and computation speed.
    • liger_kernel:
      • A custom kernel optimized for specific hardware or computation patterns.
      • Could provide tailored optimizations for matrix multiplications or attention mechanisms.
  • Purpose:

    • Improve training and inference speed.
    • Reduce memory consumption without affecting accuracy.

几种方法总结

Feature Purpose Key Options
量化等级 Control model precision and resource usage. none, 4-bit, 8-bit
量化方法 Define the quantization algorithm. bitsandbytes, hqq
RoPE插值方法 Adjust rotary embeddings for flexibility. none, linear, dynamic
加速方法 Optimize training/inference for speed. auto, flashattn2, unsloth, liger_kernel

LoRA

LoRA 秩 (LoRA Rank)

  • Meaning: Determines the rank of the low-rank decomposition matrices used in LoRA.
  • Impact:
    • A higher rank increases the capacity of LoRA, allowing it to learn more complex patterns but uses more memory.
    • A lower rank makes the model more lightweight but might limit its adaptability.
  • Default: Often set between 4 and 16 depending on the model and hardware.

LoRA 缩放系数 (LoRA Scaling Factor)

  • Meaning: A scaling factor applied to the LoRA parameters to adjust their contribution to the model.
  • Impact:
    • Higher scaling factors emphasize LoRA’s contribution.
    • Lower scaling factors blend LoRA more subtly with the base model.

LoRA 随机丢弃 (LoRA Dropout)

  • Meaning: Adds dropout regularization to the LoRA weights during training.
  • Purpose:
    • Helps prevent overfitting.
    • Improves generalization of the fine-tuned model.
  • Range: Typically between 0 (no dropout) and 0.5.

LoRA+ 学习率比例 (LoRA+ Learning Rate Scale)

  • Meaning: Scales the learning rate applied to the LoRA parameters.
  • Purpose:
    • Fine-tune LoRA parameters more efficiently by adjusting their learning rate.
  • Impact:
    • Higher values speed up learning but risk instability.
    • Lower values ensure stability at the cost of slower training.

新建适配器 (New Adapter)

  • Meaning: Creates a new, randomly initialized adapter configuration on top of the existing one.
  • Purpose:
    • Useful for training different configurations for multiple tasks without overwriting existing ones.

使用 rslora (Use rslora)

  • Meaning: Enables a “stable LoRA” method that adds regularization for better training convergence.
  • Purpose:
    • Prevents LoRA parameters from diverging significantly during training.

使用 DoRA (Use DoRA)

  • Meaning: Stands for Decomposed Residual Adapter, which provides a more granular decomposition in fine-tuning.
  • Purpose:
    • Improves efficiency by focusing on specific residual connections in the model.

使用 PiSSA 方法 (Use PiSSA)

  • Meaning: PiSSA refers to a custom or experimental optimization technique for LoRA training.
  • Purpose:
    • Adds advanced optimizations for parameter-efficient fine-tuning.

LoRA 作用模块 (LoRA Target Modules)

  • Meaning: Specifies the target modules (e.g., attention layers, feed-forward layers) where LoRA is applied.
  • Purpose:
    • Fine-tune only specific parts of the model to save resources or improve performance on specific tasks.

附加模块 (Additional Modules)

  • Meaning: Specifies additional trainable modules outside of LoRA layers.
  • Purpose:
    • Enables training of non-LoRA components, offering broader customization for specific tasks.

Summary Table

Parameter Purpose
LoRA 秩 Sets the rank of the LoRA matrices, controlling model complexity and resource usage.
LoRA 缩放系数 Adjusts the scale of LoRA parameters to balance their influence in the model.
LoRA 随机丢弃 Introduces dropout for regularization to prevent overfitting.
LoRA+ 学习率比例 Scales the learning rate of LoRA parameters for balanced fine-tuning.
新建适配器 Creates a new adapter configuration for multi-domain or task-specific tuning.
使用 rslora Ensures stable training of LoRA parameters with added regularization.
使用 DoRA Adds decomposed residual adapters for more granular fine-tuning.
使用 PiSSA Adds experimental methods for advanced LoRA optimization.
LoRA 作用模块 Defines the modules (e.g., attention layers) where LoRA will be applied.
附加模块 Specifies trainable modules beyond LoRA layers, enabling broader customization.

评估

命令

1
llamafactory-cli eval qwen_lora_eval.yaml

配置文件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
### model
model_name_or_path: /data/Qwen2-1.5B-Chat
adapter_name_or_path: /app/saves/data/Qwen2-1.5B-Chat/2024-12-25_02_02_19

### method
finetuning_type: lora

### dataset
task: mmlu_test  # choices: [mmlu_test, ceval_validation, cmmlu_test]
template: fewshot
lang: en
n_shot: 5

### output
save_dir: /app/saves/data/Qwen2-1.5B-Chat/eval/res

### eval
batch_size: 4

参考