-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Qwen3-14B训练过程中出现torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #7967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
请问一下你用的llama factory是哪个版本呀 |
上周更新的新版,具体是哪个版本没看 |
我这跟你出现了一样的问题,我是双2080ti,单卡就没事 |
我也是上周更新的,为啥呀,大佬你解决了吗 |
开 zero3 |
我想问下楼主双4090就可以微调14B了么,我的双4090在8B就爆显存了,是用的量化还是我的训练有问题😮 |
@zkj12321 换成8b了在训练50万条alpaca格式数据集会出现训练一半然后爆显存的情况,我的建议是把数据集控制在10万左右,或者适当减少参数比如批处理大小、梯度大小等 |
@hiyouga 请问是在本地文件修改么,在llama-factory的可视化网页没有找到相关参数 |
@chichengzibu 开 zero3应该是在webui最后面有个DeepSpeed stage“多卡训练的 DeepSpeed stage”选择 |
@zkj12321 我的deepspeed开启后就直接报错,双卡的话,你网页版下面显示设备是多少,如果是2就是使用双卡,不是的话有个llamafactory命令是可以指定显卡0.1的你可以找找,用完之后看一眼显卡占用就知道用没用上了 |
Uh oh!
There was an error while loading. Please reload this page.
Reminder
System Info
切换Qwen3-8B就没事,14B无论怎么改参数都会出现这个问题
本人设备单机多卡,双4090按理来说不会出现显存问题
Reproduction
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 17408,
"max_position_embeddings": 40960,
"max_window_layers": 40,
"model_type": "qwen3",
"num_attention_heads": 40,
"num_hidden_layers": 40,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.51.3",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
Others
No response
The text was updated successfully, but these errors were encountered: