You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lmc8133
changed the title
No such file or directory zero_pp_rank_4_mp_rank_00_optim_states.pt
Continue training error: No such file or directory zero_pp_rank_4_mp_rank_00_optim_states.pt
May 19, 2025
Uh oh!
There was an error while loading. Please reload this page.
Reminder
System Info
[2025-05-19 14:09:56,423] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 05-19 14:10:15 init.py:190] Automatically detected platform cuda.
llamafactory
version: 0.9.3.dev0Reproduction
Training script
torchrun --nproc_per_node=$NUM_PROCESSES --nnodes=$WORLD_SIZE --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT src/train.py
--deepspeed examples/deepspeed/ds_z1_config.json
--stage sft
--do_train
--enable_liger_kernel
--flash_attn fa2
--model_name_or_path qwen25_math_7b
--dataset [some datasets]
--dataset_dir data
--template qwen
--finetuning_type full
--output_dir qwen25_math_7b_0516
--overwrite_cache
--overwrite_output_dir
--cutoff_len 16384
--warmup_ratio 0.1
--save_steps 500
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--gradient_accumulation_steps 16
--ddp_timeout 180000000
--learning_rate 5e-5
--lr_scheduler_type cosine
--resume_from_checkpoint qwen25_math_7b_0516/checkpoint-1500/
--logging_steps 1
--plot_loss
--num_train_epochs 6
--bf16
--seed 17
--report_to "none" \
Others
I want to continue training based on resumed checkpoint, but I got this error.
ds_z1_config.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true
}
}
The text was updated successfully, but these errors were encountered: