-
Notifications
You must be signed in to change notification settings - Fork 1.5k
MoE finetuning extreme slow #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
hi, it is expected as the MoE modeling code in transformers is not optimized and finetune.py uses transformers. currently, the optimized usecase is inference in vllm with the original model, whose base implementation is contributed by the community. |
Thanks for the quick update. Any plans from your side to add an optimized implementation to HF transformers? |
All MoE models in To be frank, I don't think it could be done in |
This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread. |
@jklj077 Hi, when I use transformers to sft qwen2 57b moe on 32xA100 80G with input length 2048, it is oom. is there something wrong with my usage? |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The finetuning of Qwen2-57B-A14B-Instruct is extremely slow compared to finetuning of Qwen2-72B-Instruct.
Here are the runtimes:
Qwen/Qwen2-7B-Instruct:
{'train_runtime': 100.8509, 'train_samples_per_second': 5.652, 'train_steps_per_second': 0.099, 'train_loss': 0.751581035554409, 'epoch': 10.0}
Qwen/Qwen2-72B-Instruct:
{'train_runtime': 483.8572, 'train_samples_per_second': 1.178, 'train_steps_per_second': 0.021, 'train_loss': 0.6512975960969924, 'epoch': 10.0}
Qwen/Qwen2-57B-A14B-Instruct:
{'train_runtime': 2713.6648, 'train_samples_per_second': 0.21, 'train_steps_per_second': 0.004, 'train_loss': 10.314393818378448, 'epoch': 10.0}
I'm using the finetune.sh /finetune.py from this repository with --use_lora True and your provided deepspeed 3 config.
I've set the per_device_train_batch_size to 1
Hardware Setup is 8xH100 80GB
Environment:
cutlass is v3.5.0
The text was updated successfully, but these errors were encountered: