求大佬相助！单卡/多卡lora微调qwen都会卡住，但是无报错信息 #8118

xinxinzi8 · 2025-05-20T10:02:52Z

Reminder

I have read the above rules and searched the existing issues.

System Info

llamafactory version: 0.9.3.dev0
Platform: Linux-4.18.0-2.6.8.x86_64-x86_64-with-glibc2.39
Python version: 3.12.3
PyTorch version: 2.6.0a0+df5bbc09d1.nv24.12 (GPU)
Transformers version: 4.51.3
Datasets version: 3.5.0
Accelerate version: 1.6.0
PEFT version: 0.15.1
TRL version: 0.9.6
GPU type: NVIDIA GeForce RTX 4090
GPU number: 2
GPU memory: 23.65GB
DeepSpeed version: 0.16.5

Reproduction


[INFO|2025-05-20 09:41:46] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-05-20 09:41:46] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-20 09:41:46] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[INFO|2025-05-20 09:41:46] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-05-20 09:41:46] llamafactory.model.model_utils.misc:143 >> Found linear modules: gate_proj,up_proj,q_proj,down_proj,k_proj,v_proj,o_proj
[INFO|2025-05-20 09:42:14] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 7,635,801,600 || trainable%: 0.2643
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:748] 2025-05-20 09:42:15,081 >> Using auto half precision backend
[INFO|trainer.py:2414] 2025-05-20 09:42:15,641 >> ***** Running training *****
[INFO|trainer.py:2415] 2025-05-20 09:42:15,641 >>   Num examples = 1,020
[INFO|trainer.py:2416] 2025-05-20 09:42:15,641 >>   Num Epochs = 5
[INFO|trainer.py:2417] 2025-05-20 09:42:15,641 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:2420] 2025-05-20 09:42:15,641 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:2421] 2025-05-20 09:42:15,641 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:2422] 2025-05-20 09:42:15,641 >>   Total optimization steps = 1,275
[INFO|trainer.py:2423] 2025-05-20 09:42:15,645 >>   Number of trainable parameters = 20,185,088
  0%|                                                                                                           | 0/1275 [00:00<?, ?it/s]

Others

之前使用docker-hub上三个月前发布的镜像进行调qwen2.5，没有任何问题。现在想调一下qwen3，就用了新的镜像文件。
于是就出现了以下的情况：
单卡/多卡训练qwen2.5/3都会卡住，不显示进度，且前面无任何报错。查看gpu有进程占用，但是gpu使用率为0。
小白，求大佬们帮帮忙！

The text was updated successfully, but these errors were encountered:

wwfnb · 2025-05-21T10:37:31Z

mark

lnrick · 2025-05-22T07:54:52Z

我刚刚遇见的情况，不知道是否对你有用。变更deepspeed版本为0.16.4 （官方推荐版本）正常，最新版本和你一样跑不过去。

Roylin1003 · 2025-05-23T18:01:03Z

請在 Output dir 裡面尋找 running_log.txt，拉到最下面，會看到最後發生了什麼事。可能遇到的狀況有

出現 CUDA 相關錯誤，顯示卡驅動程式需要更新
PyTorch 版本太舊。
pip install torch torchvision torchaudio --index-url
Checkpoint 建立失敗，重新安裝 huggingface_hub。
pip uninstall huggingface_hub -y # 移除舊的 egg 格式安裝
pip install huggingface_hub --upgrade # 重新安裝正確版本

xinxinzi8 · 2025-05-25T06:58:53Z

請在 Output dir 裡面尋找 running_log.txt，拉到最下面，會看到最後發生了什麼事。可能遇到的狀況有

出現 CUDA 相關錯誤，顯示卡驅動程式需要更新

PyTorch 版本太舊。
pip install torch torchvision torchaudio --index-url

Checkpoint 建立失敗，重新安裝 huggingface_hub。
pip uninstall huggingface_hub -y # 移除舊的 egg 格式安裝
pip install huggingface_hub --upgrade # 重新安裝正確版本

感谢您的回复！我的output_dir中只有runs文件夹，最里层是一个以events.out.tfevents开头的文件，并没有看到running_log.txt。 @Roylin1003

xinxinzi8 · 2025-05-26T03:07:21Z

我刚刚遇到的情况，不知道对你是否有用。更改deepspeed版本为0.16.4（官方版本推荐）正常，最新版本和你一样跑不过去。

感谢！不过对我这个还是没用。不知道能否能提供排查问题的思路？我怀疑是不是其他包的版本问题 @lnrick

Eureka-Maggie · 2025-05-28T23:49:13Z

我遇到了类似的现象。你稍微等一下应该会发现，GPU的利用率是“一阵一阵”的。我意外发现这个情况是当我copy了2条官方的mllm_video_audio里的数据，构成5条数据，bs=2，accumulate_step = 4。就会发现类似卡住的现象，需要等一阵子，直到最后会直接显示一个巨大的loss，一次更新也没有。具体的解决方案还在尝试和debug。。。。感觉是里面iter数据的时候的问题。

xinxinzi8 added bug Something isn't working pending This problem is yet to be addressed labels May 20, 2025

xinxinzi8 changed the title ~~求大佬相助！docker单卡训练卡住，但是无报错信息~~ 求大佬相助！docker单卡/多卡lora微调qwen都会卡住，但是无报错信息 May 20, 2025

xinxinzi8 changed the title ~~求大佬相助！docker单卡/多卡lora微调qwen都会卡住，但是无报错信息~~ 求大佬相助！单卡/多卡lora微调qwen都会卡住，但是无报错信息 May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

求大佬相助！单卡/多卡lora微调qwen都会卡住，但是无报错信息 #8118

求大佬相助！单卡/多卡lora微调qwen都会卡住，但是无报错信息 #8118

xinxinzi8 commented May 20, 2025 •

edited

Loading

wwfnb commented May 21, 2025

Uh oh!

lnrick commented May 22, 2025 •

edited

Loading

Uh oh!

Roylin1003 commented May 23, 2025 •

edited

Loading

Uh oh!

xinxinzi8 commented May 25, 2025 •

edited

Loading

Uh oh!

xinxinzi8 commented May 26, 2025 •

edited

Loading

Uh oh!

Eureka-Maggie commented May 28, 2025 •

edited

Loading

Uh oh!

求大佬相助！单卡/多卡lora微调qwen都会卡住，但是无报错信息 #8118

求大佬相助！单卡/多卡lora微调qwen都会卡住，但是无报错信息 #8118

Comments

xinxinzi8 commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reminder

System Info

Reproduction

Others

wwfnb commented May 21, 2025

Uh oh!

lnrick commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Roylin1003 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinxinzi8 commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinxinzi8 commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Eureka-Maggie commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xinxinzi8 commented May 20, 2025 •

edited

Loading

lnrick commented May 22, 2025 •

edited

Loading

Roylin1003 commented May 23, 2025 •

edited

Loading

xinxinzi8 commented May 25, 2025 •

edited

Loading

xinxinzi8 commented May 26, 2025 •

edited

Loading

Eureka-Maggie commented May 28, 2025 •

edited

Loading