Skip to content

Qwen3-14B训练过程中出现torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #7967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
chichengzibu opened this issue May 7, 2025 · 10 comments
Labels
duplicate This issue or pull request already exists

Comments

@chichengzibu
Copy link

chichengzibu commented May 7, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

[rank0]:[W507 09:34:27.979354299 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Loading checkpoint shards:  62%|████████████████████████████████████▎                     | 5/8 [00:05<00:02,  1.05it/s]W0507 09:34:28.785000 139621627877184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4837 closing signal SIGTERM
E0507 09:34:29.002000 139621627877184 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 4836) of binary: /home/administrator/anaconda3/envs/llm/bin/python
Traceback (most recent call last):
  File "/home/administrator/anaconda3/envs/llm/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main    run(args)
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/administrator/tools/LLM/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-07_09:34:28
  host      : BF-202504091019.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4836)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/home/administrator/anaconda3/envs/llm/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/administrator/tools/LLM/LLaMA-Factory/src/llamafactory/cli.py", line 95, in main
    process = subprocess.run(
              ^^^^^^^^^^^^^^^
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '2', '--master_addr', '127.0.0.1', '--master_port', '57137', '/home/administrator/tools/LLM/LLaMA-Factory/src/llamafactory/launcher.py', 'saves/Qwen3-14B-Instruct/lora/train_2025-05-07-08-27-35/training_args.yaml']' returned non-zero exit status 1.

切换Qwen3-8B就没事,14B无论怎么改参数都会出现这个问题
本人设备单机多卡,双4090按理来说不会出现显存问题

Reproduction

"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 17408,
"max_position_embeddings": 40960,
"max_window_layers": 40,
"model_type": "qwen3",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936

Others

No response

@chichengzibu chichengzibu added bug Something isn't working pending This problem is yet to be addressed labels May 7, 2025
@rixyyy
Copy link

rixyyy commented May 7, 2025

请问一下你用的llama factory是哪个版本呀

@chichengzibu
Copy link
Author

请问一下你用的llama factory是哪个版本呀

上周更新的新版,具体是哪个版本没看

@Karrotina
Copy link

我这跟你出现了一样的问题,我是双2080ti,单卡就没事
File "/home/karrot/anaconda3/envs/Fine-Tuning/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '2', '--master_addr', '127.0.0.1', '--master_port', '49435', '/home/karrot/LLM/LLaMA-Factory/src/llamafactory/launcher.py', '/home/karrot/LLM/Output/train_2025-05-07-17-17-55/training_args.yaml']' returned non-zero exit status 1.

@mg610
Copy link

mg610 commented May 12, 2025

请问一下你用的llama factory是哪个版本呀

上周更新的新版,具体是哪个版本没看

我也是上周更新的,为啥呀,大佬你解决了吗

@hiyouga
Copy link
Owner

hiyouga commented May 12, 2025

开 zero3

@hiyouga hiyouga closed this as completed May 12, 2025
@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels May 12, 2025
@hiyouga hiyouga closed this as completed May 12, 2025
@hiyouga hiyouga added duplicate This issue or pull request already exists and removed solved This problem has been already solved labels May 12, 2025
@zkj12321
Copy link

我想问下楼主双4090就可以微调14B了么,我的双4090在8B就爆显存了,是用的量化还是我的训练有问题😮

@chichengzibu
Copy link
Author

@zkj12321 换成8b了在训练50万条alpaca格式数据集会出现训练一半然后爆显存的情况,我的建议是把数据集控制在10万左右,或者适当减少参数比如批处理大小、梯度大小等

@chichengzibu
Copy link
Author

@hiyouga 请问是在本地文件修改么,在llama-factory的可视化网页没有找到相关参数

@zkj12321
Copy link

@chichengzibu 开 zero3应该是在webui最后面有个DeepSpeed stage“多卡训练的 DeepSpeed stage”选择 3,我看生成命令之后就是--deepspeed cache/ds_z3_offload_config.json 。还有个问题就是楼主的单机多卡是怎么启动的,我这边好像只在cuda0上跑了所以爆显存了
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 2.02 GiB is free.
Process 2574005 has 21.06 GiB memory in use.
Process 2600386 has 574.00 MiB memory in use.

@chichengzibu
Copy link
Author

@zkj12321 我的deepspeed开启后就直接报错,双卡的话,你网页版下面显示设备是多少,如果是2就是使用双卡,不是的话有个llamafactory命令是可以指定显卡0.1的你可以找找,用完之后看一眼显卡占用就知道用没用上了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

6 participants