Qwen3-14B训练过程中出现torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #7967

chichengzibu · 2025-05-07T02:04:42Z

Reminder

I have read the above rules and searched the existing issues.

System Info

[rank0]:[W507 09:34:27.979354299 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
Loading checkpoint shards:  62%|████████████████████████████████████▎                     | 5/8 [00:05<00:02,  1.05it/s]W0507 09:34:28.785000 139621627877184 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 4837 closing signal SIGTERM
E0507 09:34:29.002000 139621627877184 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 4836) of binary: /home/administrator/anaconda3/envs/llm/bin/python
Traceback (most recent call last):
  File "/home/administrator/anaconda3/envs/llm/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main    run(args)
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/administrator/tools/LLM/LLaMA-Factory/src/llamafactory/launcher.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-05-07_09:34:28
  host      : BF-202504091019.
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 4836)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Traceback (most recent call last):
  File "/home/administrator/anaconda3/envs/llm/bin/llamafactory-cli", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/administrator/tools/LLM/LLaMA-Factory/src/llamafactory/cli.py", line 95, in main
    process = subprocess.run(
              ^^^^^^^^^^^^^^^
  File "/home/administrator/anaconda3/envs/llm/lib/python3.11/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '2', '--master_addr', '127.0.0.1', '--master_port', '57137', '/home/administrator/tools/LLM/LLaMA-Factory/src/llamafactory/launcher.py', 'saves/Qwen3-14B-Instruct/lora/train_2025-05-07-08-27-35/training_args.yaml']' returned non-zero exit status 1.

切换Qwen3-8B就没事，14B无论怎么改参数都会出现这个问题
本人设备单机多卡，双4090按理来说不会出现显存问题

Reproduction

"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 17408,
"max_position_embeddings": 40960,
"max_window_layers": 40,
"model_type": "qwen3",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936

Others

No response

The text was updated successfully, but these errors were encountered:

rixyyy · 2025-05-07T07:40:29Z

请问一下你用的llama factory是哪个版本呀

chichengzibu · 2025-05-07T07:48:47Z

请问一下你用的llama factory是哪个版本呀

上周更新的新版，具体是哪个版本没看

Karrotina · 2025-05-07T09:59:59Z

我这跟你出现了一样的问题，我是双2080ti，单卡就没事
File "/home/karrot/anaconda3/envs/Fine-Tuning/lib/python3.11/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '2', '--master_addr', '127.0.0.1', '--master_port', '49435', '/home/karrot/LLM/LLaMA-Factory/src/llamafactory/launcher.py', '/home/karrot/LLM/Output/train_2025-05-07-17-17-55/training_args.yaml']' returned non-zero exit status 1.

mg610 · 2025-05-12T13:16:12Z

请问一下你用的llama factory是哪个版本呀

上周更新的新版，具体是哪个版本没看

我也是上周更新的，为啥呀，大佬你解决了吗

hiyouga · 2025-05-12T13:31:55Z

开 zero3

zkj12321 · 2025-05-13T02:57:43Z

我想问下楼主双4090就可以微调14B了么，我的双4090在8B就爆显存了，是用的量化还是我的训练有问题😮

chichengzibu · 2025-05-13T03:03:14Z

@zkj12321 换成8b了在训练50万条alpaca格式数据集会出现训练一半然后爆显存的情况，我的建议是把数据集控制在10万左右，或者适当减少参数比如批处理大小、梯度大小等

chichengzibu · 2025-05-13T03:05:23Z

@hiyouga 请问是在本地文件修改么，在llama-factory的可视化网页没有找到相关参数

zkj12321 · 2025-05-13T04:12:02Z

@chichengzibu 开 zero3应该是在webui最后面有个DeepSpeed stage“多卡训练的 DeepSpeed stage”选择 3，我看生成命令之后就是--deepspeed cache/ds_z3_offload_config.json 。还有个问题就是楼主的单机多卡是怎么启动的，我这边好像只在cuda0上跑了所以爆显存了
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.32 GiB.
GPU 0 has a total capacity of 23.65 GiB of which 2.02 GiB is free.
Process 2574005 has 21.06 GiB memory in use.
Process 2600386 has 574.00 MiB memory in use.

chichengzibu · 2025-05-13T04:21:06Z

@zkj12321 我的deepspeed开启后就直接报错，双卡的话，你网页版下面显示设备是多少，如果是2就是使用双卡，不是的话有个llamafactory命令是可以指定显卡0.1的你可以找找，用完之后看一眼显卡占用就知道用没用上了

chichengzibu added bug Something isn't working pending This problem is yet to be addressed labels May 7, 2025

hiyouga closed this as completed May 12, 2025

hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels May 12, 2025

hiyouga closed this as completed May 12, 2025

hiyouga added duplicate This issue or pull request already exists and removed solved This problem has been already solved labels May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3-14B训练过程中出现torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #7967

Qwen3-14B训练过程中出现torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #7967

chichengzibu commented May 7, 2025 •

edited by hiyouga

Loading

rixyyy commented May 7, 2025

Uh oh!

chichengzibu commented May 7, 2025

Uh oh!

Karrotina commented May 7, 2025

Uh oh!

mg610 commented May 12, 2025

Uh oh!

hiyouga commented May 12, 2025 •

edited

Loading

Uh oh!

zkj12321 commented May 13, 2025

Uh oh!

chichengzibu commented May 13, 2025

Uh oh!

chichengzibu commented May 13, 2025

Uh oh!

zkj12321 commented May 13, 2025

Uh oh!

chichengzibu commented May 13, 2025

Uh oh!

Qwen3-14B训练过程中出现torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #7967

Qwen3-14B训练过程中出现torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #7967

Comments

chichengzibu commented May 7, 2025 • edited by hiyouga Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reminder

System Info

Reproduction

Others

rixyyy commented May 7, 2025

Uh oh!

chichengzibu commented May 7, 2025

Uh oh!

Karrotina commented May 7, 2025

Uh oh!

mg610 commented May 12, 2025

Uh oh!

hiyouga commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zkj12321 commented May 13, 2025

Uh oh!

chichengzibu commented May 13, 2025

Uh oh!

chichengzibu commented May 13, 2025

Uh oh!

zkj12321 commented May 13, 2025

Uh oh!

chichengzibu commented May 13, 2025

Uh oh!

chichengzibu commented May 7, 2025 •

edited by hiyouga

Loading

hiyouga commented May 12, 2025 •

edited

Loading