🚨FAQs | 常见问题🚨 #4614

hiyouga · 2024-06-28T19:01:44Z

Note

Please avoid creating issues regarding the following questions, as they might be closed without a response.
请避免创建与下述问题有关的 issues，这些 issues 可能不会被回复。

Tip

Documentation: https://llamafactory.readthedocs.io/en/latest/
中文文档：https://llamafactory.readthedocs.io/zh-cn/latest/
NPU 中文文档：https://ascend.github.io/docs/sources/llamafactory/
中文版入门教程：https://zhuanlan.zhihu.com/p/695287607

Most of problems / 大多数问题

Versions of dependencies conflict / 依赖库版本冲突

Supported models are not found / 无法找到已支持的模型

llamafactory-cli: command not found / 无法找到命令

Please update repository and install again using the following approach.

请按照以下方式更新仓库并重新安装。

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation

Out-of-memory / 显存溢出

The out-of-memory (OOM) error during training is usually due to insufficient VRAM of the current device to complete the computation. You can try the following methods to deal with this issue:

Reduce the training batch size per_device_train_batch_size: 1
Reduce the maximum sequence length cutoff_len: 512
Replace compute kernels enable_liger_kernel: true and use_unsloth_gc: true
Use DeepSpeed ZeRO-3 or FSDP to partition model weights on multiple devices or use CPU offloading
Set quantization_bit: 4 to quantize model parameters (only compatible with LoRA tuning)
Use the paged optimizer optim: paged_adamw_8bit

模型训练时显存溢出，通常是由于当前某个设备的剩余显存不足以完成计算任务。可尝试下述方法解决：

降低批处理大小 per_device_train_batch_size: 1
降低最大序列长度 cutoff_len: 512
替换模型算子 enable_liger_kernel: true 和 use_unsloth_gc: true
使用 DeepSpeed ZeRO-3 或 FSDP 将模型权重拆分到多个设备或使用 CPU Offloading
设置 quantization_bit: 4 量化模型参数（仅限于 LoRA 方法）
使用分页低精度优化器 optim: paged_adamw_8bit

Unsatisfying fine-tuning results / 微调效果无法令人满意

Unsatisfying fine-tuning results are usually due to insufficient training samples, leading to underfitting. You can try the following methods to deal with this issue:

Increase the size of the training dataset
Increase the number of epochs num_train_epochs: 5.0 or steps max_steps: 1000
Use a larger learning rate learning_rate: 2.0e-4
Use different fine-tuning method finetuning_type: freeze or finetuning_type: full

微调效果较差，通常是由于训练样本过少，导致模型欠拟合。可尝试下述方法解决：

增加训练数据集的大小
提高训练轮数 num_train_epochs: 5.0 或步数 max_steps: 1000
增大学习率 learning_rate: 2.0e-4
使用不同的微调方法 finetuning_type: freeze 或 finetuning_type: full

Corrupted or repeated model responses / 胡乱或重复的模型回答

If this issue occurs before training, it is usually due to using an unaligned (base) model or a mismatched template. Please ensure an aligned (instruct/chat) model and correct template are used.
If this issue occurs after training, please check if the template used for training and inference is consistent. And do not forget to check if the overfitting appeared. You can try decreasing the number of epochs num_train_epochs and learning rate learning_rate to deal with the overfitting issue.

若该问题发生在训练之前，通常是由于使用了未经对齐（base）的模型或不恰当的模板 template，请保证使用对齐后（instruct/chat）的模型和正确的模板 template。
若该问题发生在训练之后，请检查训练和推理使用的模板 template 是否一致，同时检查是否发生了过拟合。如果发生了过拟合，请减小训练轮数 num_train_epochs 和学习率 learning_rate。

Training hangs / 训练进程卡住

If distributed training was not enabled, please use the following command to check if the CUDA version of PyTorch is installed correctly:

python -c "import torch; print(torch.cuda.is_available())"

If distributed training was enabled, try setting the environment variable export NCCL_P2P_LEVEL=NVL.

如果没有使用分布式训练，请使用下述命令检查 CUDA 版本的 PyTorch 是否被正确安装：

python -c "import torch; print(torch.cuda.is_available())"

如果使用了分布式训练，请尝试设置环境变量 export NCCL_P2P_LEVEL=NVL。

LLaMA Board cannot display datasets / LLaMA Board 无法显示数据集

Please ensure that the working directory when launching the LLaMA Board is the same as the LLaMA-Factory directory.

请确保启动 LLaMA Board 时的工作目录与 LLaMA-Factory 主目录一致。

How to shard model weights on multiple devices / 如何模型权重拆分到多个设备上

During the training phase, please refer to the examples about how to use the DeepSpeed ZeRO-3 (recommended) or FSDP.
During the inference phase, please use vLLM to enable the tensor parallelism: examples.

在训练阶段，请参考 examples 使用 DeepSpeed ZeRO-3（推荐）或 FSDP。
在推理阶段，请使用 vLLM 来开启张量并行：examples.

How to use ORPO or SimPO / 如何使用 ORPO 或 SimPO

Modify the pref_loss in example script to orpo or simpo.

将示例脚本中的 pref_loss 改为 orpo 或 simpo。

How to debug with VSCode / 如何用 VSCode 调试程序

See #5337

Why the number of examples is small in pre-training / 为什么预训练样本数比实际的少

We automatically use packing in pre-training, where we concatenate multiple samples into one sequence, so the number of examples displayed is less than the actual number.

我们在预训练时候自动使用了 Packing，将多个样本打包成一条序列，因此显示的样本数量会比实际的少。

Will the training data be shuffled / 训练数据是否会被打乱

LLaMA-Factory will randomly shuffle the training data by default. You can use disable_shuffling to turn off the shuffling.

LLaMA-Factory 默认会随机打乱训练数据，可使用 disable_shuffling 关闭打乱。

How to enable streaming / 如何启用流式数据读取

We recommend shuffling the dataset before training if you want to use streaming.

如果您希望使用流式数据读取，请在训练前手动打乱数据。

buffer_size: 128
preprocessing_batch_size: 128
streaming: true
accelerator_config:
  dispatch_batches: false

Tip

If the problems still exist with the latest code, please create an issue.
若使用最新的代码仍然无法解决问题，请创建一个 issue。

The text was updated successfully, but these errors were encountered:

hiyouga pinned this issue Jun 28, 2024

github-actions bot added the pending This problem is yet to be addressed label Jun 28, 2024

hiyouga added good first issue Good for newcomers and removed pending This problem is yet to be addressed labels Jun 28, 2024

Repository owner locked as too heated and limited conversation to collaborators Jun 29, 2024

Repository owner unlocked this conversation Jun 29, 2024

Repository owner locked as too heated and limited conversation to collaborators Jun 29, 2024

codemayq mentioned this issue Jul 11, 2024

importlib.metadata.PackageNotFoundError: llamafactory #4769

Closed

1 task

RyanOvO mentioned this issue Jul 14, 2024

微调训练后，模型无法回复 #4794

Closed

This was referenced Oct 15, 2024

lora 微调 qwen 32B 的 base模型，输出有Human字段 #5707

Closed

一机多卡（T4*2）,增量预训练，lora微调显存溢出 #5723

Closed

This was referenced Dec 5, 2024

Model Stops Generating Prematurely with llamafactory-cli chat #6256

Closed

请问单机多卡怎么训练 #6257

Closed

DominicTWHV mentioned this issue Dec 13, 2024

微调后倒出的模型使用llama.cpp生成gguf文件后导入ollama中每次ollama都胡乱回答 #6320

Closed

1 task

hiyouga mentioned this issue Dec 18, 2024

ValueError: This model does not support image input. #6375

Closed

1 task

This was referenced Jan 2, 2025

不知为何训练数据足够但是效果差几乎没有起到任何影响，求大佬指教！ #6504

Closed

单机多卡运行一段时间就报OOM问题 #6531

Closed

hiyouga marked this as a duplicate of #6735 Jan 22, 2025

hiyouga mentioned this issue Feb 1, 2025

training_args.parallel_mode param questions #6766

Closed

1 task

hiyouga marked this as a duplicate of #7559 Apr 1, 2025

hiyouga mentioned this issue Apr 2, 2025

[model] Add Qwen2.5-Omni #7537

Merged

7 tasks

hiyouga mentioned this issue Apr 10, 2025

为什么我有20万样本，Lora预训练的时候只用了3万多？ #7667

Closed

1 task

Luffy-ZY-Wang mentioned this issue Apr 16, 2025

Need Help!! qwen2.5vl7b lora sft with deepspeed zero3 #7588

Closed

1 task

hiyouga marked this as a duplicate of #7798 Apr 22, 2025

hiyouga marked this as a duplicate of #7828 Apr 23, 2025

hiyouga mentioned this issue Apr 28, 2025

如何model parallel呢？ #7871

Closed

1 task

hiyouga marked this as a duplicate of #7908 Apr 29, 2025

hiyouga marked this as a duplicate of #7893 Apr 29, 2025

hiyouga marked this as a duplicate of #7920 Apr 30, 2025

hiyouga marked this as a duplicate of #7894 May 6, 2025

hiyouga marked this as a duplicate of #7972 May 7, 2025

hiyouga marked this as a duplicate of #8011 May 11, 2025

hiyouga marked this as a duplicate of #7370 May 12, 2025

hiyouga marked this as a duplicate of #7967 May 12, 2025

Repository owner unlocked this conversation May 13, 2025

Repository owner locked and limited conversation to collaborators May 13, 2025

Repository owner unlocked this conversation May 13, 2025

Repository owner locked and limited conversation to collaborators May 13, 2025

hiyouga marked this as a duplicate of #8072 May 15, 2025

hiyouga marked this as a duplicate of #8102 May 19, 2025

hiyouga marked this as a duplicate of #8143 May 23, 2025

hiyouga marked this as a duplicate and then as not a duplicate of #8146 May 26, 2025

hiyouga marked this as a duplicate of #8190 May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚨FAQs | 常见问题🚨 #4614

🚨FAQs | 常见问题🚨 #4614

hiyouga commented Jun 28, 2024 •

edited

Loading

🚨FAQs | 常见问题🚨 #4614

🚨FAQs | 常见问题🚨 #4614

Comments

hiyouga commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Most of problems / 大多数问题

Versions of dependencies conflict / 依赖库版本冲突

Supported models are not found / 无法找到已支持的模型

llamafactory-cli: command not found / 无法找到命令

Out-of-memory / 显存溢出

Unsatisfying fine-tuning results / 微调效果无法令人满意

Corrupted or repeated model responses / 胡乱或重复的模型回答

Training hangs / 训练进程卡住

LLaMA Board cannot display datasets / LLaMA Board 无法显示数据集

How to shard model weights on multiple devices / 如何模型权重拆分到多个设备上

How to use ORPO or SimPO / 如何使用 ORPO 或 SimPO

How to debug with VSCode / 如何用 VSCode 调试程序

Why the number of examples is small in pre-training / 为什么预训练样本数比实际的少

Will the training data be shuffled / 训练数据是否会被打乱

How to enable streaming / 如何启用流式数据读取

hiyouga commented Jun 28, 2024 •

edited

Loading