Skip to content

Continue training error: No such file or directory zero_pp_rank_4_mp_rank_00_optim_states.pt #8098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
lmc8133 opened this issue May 19, 2025 · 2 comments
Open
1 task done
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@lmc8133
Copy link

lmc8133 commented May 19, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

[2025-05-19 14:09:56,423] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 05-19 14:10:15 init.py:190] Automatically detected platform cuda.

  • llamafactory version: 0.9.3.dev0
  • Platform: Linux-5.10.134-13.al8.x86_64-x86_64-with-glibc2.32
  • Python version: 3.10.13
  • PyTorch version: 2.5.1+cu121 (GPU)
  • Transformers version: 4.51.3
  • Datasets version: 2.21.0
  • Accelerate version: 1.4.0
  • PEFT version: 0.15.1
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-80GB
  • GPU memory: 79.35GB
  • DeepSpeed version: 0.16.4
  • Bitsandbytes version: 0.45.0
  • vLLM version: 0.7.2

Reproduction

error log

[rank4]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank4]:     return inner_training_loop(
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2398, in _inner_training_loop
[rank4]:     deepspeed_load_checkpoint(
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 489, in deepspeed_load_checkpoint
[rank4]:     load_path, _ = deepspeed_engine.load_checkpoint(
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2862, in load_checkpoint
[rank4]:     success = self._load_zero_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states)
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3047, in _load_zero_checkpoint
[rank4]:     zero_sd_list = self._get_all_zero_checkpoints(load_dir, tag)
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3124, in _get_all_zero_checkpoints
[rank4]:     return self._get_all_zero_checkpoint_state_dicts(zero_ckpt_names)
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3103, in _get_all_zero_checkpoint_state_dicts
[rank4]:     _state = self.checkpoint_engine.load(
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/runtime/checkpoint_engine/torch_checkpoint_engine.py", line 28, in load
[rank4]:     partition = torch.load(path, map_location=map_location, weights_only=False)
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 1319, in load
[rank4]:     with _open_file_like(f, "rb") as opened_file:
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 659, in _open_file_like
[rank4]:     return _open_file(name_or_buffer, mode)
[rank4]:   File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 640, in __init__
[rank4]:     super().__init__(open(name, mode))
[rank4]: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/qwen25_math_7b_0516/checkpoint-1500/global_step1499/zero_pp_rank_4_mp_rank_00_optim_states.pt'


Training script

torchrun --nproc_per_node=$NUM_PROCESSES --nnodes=$WORLD_SIZE --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT src/train.py
--deepspeed examples/deepspeed/ds_z1_config.json
--stage sft
--do_train
--enable_liger_kernel
--flash_attn fa2
--model_name_or_path qwen25_math_7b
--dataset [some datasets]
--dataset_dir data
--template qwen
--finetuning_type full
--output_dir qwen25_math_7b_0516
--overwrite_cache
--overwrite_output_dir
--cutoff_len 16384
--warmup_ratio 0.1
--save_steps 500
--per_device_train_batch_size 1
--per_device_eval_batch_size 4
--gradient_accumulation_steps 16
--ddp_timeout 180000000
--learning_rate 5e-5
--lr_scheduler_type cosine
--resume_from_checkpoint qwen25_math_7b_0516/checkpoint-1500/
--logging_steps 1
--plot_loss
--num_train_epochs 6
--bf16
--seed 17
--report_to "none" \

Others

I want to continue training based on resumed checkpoint, but I got this error.

ds_z1_config.json

{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true
}
}

@lmc8133 lmc8133 added bug Something isn't working pending This problem is yet to be addressed labels May 19, 2025
@lmc8133 lmc8133 changed the title No such file or directory zero_pp_rank_4_mp_rank_00_optim_states.pt Continue training error: No such file or directory zero_pp_rank_4_mp_rank_00_optim_states.pt May 19, 2025
@lmc8133
Copy link
Author

lmc8133 commented May 19, 2025

There is indeed no zero_pp_rank_4_mp_rank_00_optim_states.pt in .../checkpoint-1500/global_step1499/. What can I do

$ls checkpoint-1500/global_step1499/
bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt   bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt   bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_20_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_30_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_21_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_31_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_22_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_23_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_24_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_25_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_16_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_26_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_17_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_27_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_18_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_28_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt
bf16_zero_pp_rank_19_mp_rank_00_optim_states.pt  bf16_zero_pp_rank_29_mp_rank_00_optim_states.pt  mp_rank_00_model_states.pt

@hiyouga
Copy link
Owner

hiyouga commented May 19, 2025

try removing the bf16 prefix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

2 participants