Skip to content

求大佬相助!单卡/多卡lora微调qwen都会卡住,但是无报错信息 #8118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
xinxinzi8 opened this issue May 20, 2025 · 6 comments
Open
1 task done
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@xinxinzi8
Copy link

xinxinzi8 commented May 20, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.3.dev0
  • Platform: Linux-4.18.0-2.6.8.x86_64-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • PyTorch version: 2.6.0a0+df5bbc09d1.nv24.12 (GPU)
  • Transformers version: 4.51.3
  • Datasets version: 3.5.0
  • Accelerate version: 1.6.0
  • PEFT version: 0.15.1
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4090
  • GPU number: 2
  • GPU memory: 23.65GB
  • DeepSpeed version: 0.16.5

Reproduction


[INFO|2025-05-20 09:41:46] llamafactory.model.model_utils.checkpointing:143 >> Gradient checkpointing enabled.
[INFO|2025-05-20 09:41:46] llamafactory.model.model_utils.attention:143 >> Using torch SDPA for faster training and inference.
[INFO|2025-05-20 09:41:46] llamafactory.model.adapter:143 >> Upcasting trainable params to float32.
[INFO|2025-05-20 09:41:46] llamafactory.model.adapter:143 >> Fine-tuning method: LoRA
[INFO|2025-05-20 09:41:46] llamafactory.model.model_utils.misc:143 >> Found linear modules: gate_proj,up_proj,q_proj,down_proj,k_proj,v_proj,o_proj
[INFO|2025-05-20 09:42:14] llamafactory.model.loader:143 >> trainable params: 20,185,088 || all params: 7,635,801,600 || trainable%: 0.2643
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:748] 2025-05-20 09:42:15,081 >> Using auto half precision backend
[INFO|trainer.py:2414] 2025-05-20 09:42:15,641 >> ***** Running training *****
[INFO|trainer.py:2415] 2025-05-20 09:42:15,641 >>   Num examples = 1,020
[INFO|trainer.py:2416] 2025-05-20 09:42:15,641 >>   Num Epochs = 5
[INFO|trainer.py:2417] 2025-05-20 09:42:15,641 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:2420] 2025-05-20 09:42:15,641 >>   Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:2421] 2025-05-20 09:42:15,641 >>   Gradient Accumulation steps = 2
[INFO|trainer.py:2422] 2025-05-20 09:42:15,641 >>   Total optimization steps = 1,275
[INFO|trainer.py:2423] 2025-05-20 09:42:15,645 >>   Number of trainable parameters = 20,185,088
  0%|                                                                                                           | 0/1275 [00:00<?, ?it/s]


Others

之前使用docker-hub上三个月前发布的镜像进行调qwen2.5,没有任何问题。现在想调一下qwen3,就用了新的镜像文件。
于是就出现了以下的情况:
单卡/多卡训练qwen2.5/3都会卡住,不显示进度,且前面无任何报错。查看gpu有进程占用,但是gpu使用率为0。
小白,求大佬们帮帮忙!

Image
Image

@xinxinzi8 xinxinzi8 added bug Something isn't working pending This problem is yet to be addressed labels May 20, 2025
@xinxinzi8 xinxinzi8 changed the title 求大佬相助!docker单卡训练卡住,但是无报错信息 求大佬相助!docker单卡/多卡lora微调qwen都会卡住,但是无报错信息 May 20, 2025
@xinxinzi8 xinxinzi8 changed the title 求大佬相助!docker单卡/多卡lora微调qwen都会卡住,但是无报错信息 求大佬相助!单卡/多卡lora微调qwen都会卡住,但是无报错信息 May 20, 2025
@wwfnb
Copy link

wwfnb commented May 21, 2025

mark

@lnrick
Copy link

lnrick commented May 22, 2025

我刚刚遇见的情况,不知道是否对你有用。变更deepspeed版本为0.16.4 (官方推荐版本)正常,最新版本和你一样跑不过去。

@Roylin1003
Copy link

Roylin1003 commented May 23, 2025

請在 Output dir 裡面尋找 running_log.txt,拉到最下面,會看到最後發生了什麼事。可能遇到的狀況有

  • 出現 CUDA 相關錯誤,顯示卡驅動程式需要更新
  • PyTorch 版本太舊。
    pip install torch torchvision torchaudio --index-url
  • Checkpoint 建立失敗,重新安裝 huggingface_hub。
    pip uninstall huggingface_hub -y # 移除舊的 egg 格式安裝
    pip install huggingface_hub --upgrade # 重新安裝正確版本

@xinxinzi8
Copy link
Author

xinxinzi8 commented May 25, 2025

請在 Output dir 裡面尋找 running_log.txt,拉到最下面,會看到最後發生了什麼事。可能遇到的狀況有

  • 出現 CUDA 相關錯誤,顯示卡驅動程式需要更新
  • PyTorch 版本太舊。
    pip install torch torchvision torchaudio --index-url
  • Checkpoint 建立失敗,重新安裝 huggingface_hub。
    pip uninstall huggingface_hub -y # 移除舊的 egg 格式安裝
    pip install huggingface_hub --upgrade # 重新安裝正確版本

感谢您的回复!我的output_dir中只有runs文件夹,最里层是一个以events.out.tfevents开头的文件,并没有看到running_log.txt。 @Roylin1003

@xinxinzi8
Copy link
Author

xinxinzi8 commented May 26, 2025

我刚刚遇到的情况,不知道对你是否有用。更改deepspeed版本为0.16.4(官方版本推荐)正常,最新版本和你一样跑不过去。

感谢!不过对我这个还是没用。不知道能否能提供排查问题的思路?我怀疑是不是其他包的版本问题 @lnrick

@Eureka-Maggie
Copy link

Eureka-Maggie commented May 28, 2025

我遇到了类似的现象。你稍微等一下应该会发现,GPU的利用率是“一阵一阵”的。我意外发现这个情况是当我copy了2条官方的mllm_video_audio里的数据,构成5条数据,bs=2,accumulate_step = 4。就会发现类似卡住的现象,需要等一阵子,直到最后会直接显示一个巨大的loss,一次更新也没有。具体的解决方案还在尝试和debug。。。。感觉是里面iter数据的时候的问题。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

5 participants