Skip to content

如何得到每条数据的 loss #6165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
Word2VecT opened this issue Nov 27, 2024 · 5 comments · Fixed by #6242
Closed
1 task done

如何得到每条数据的 loss #6165

Word2VecT opened this issue Nov 27, 2024 · 5 comments · Fixed by #6242
Labels
solved This problem has been already solved

Comments

@Word2VecT
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.11.0
  • PyTorch version: 2.4.1+cu121 (GPU)
  • Transformers version: 4.45.2
  • Datasets version: 2.21.0
  • Accelerate version: 0.34.2
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A100-SXM4-80GB
  • DeepSpeed version: 0.15.3

Reproduction

torchrun --nnodes=1 --nproc-per-node=8 src/train.py
--deepspeed examples/deepspeed/ds_z3_config.json
--stage sft
--do_train
--use_fast_tokenizer
--flash_attn fa2
--model_name_or_path /mnt/petrelfs/tangzinan/LLaMA-Factory/models/LLama3.1-8B
--dataset gsm8k_train
--template llama3
--finetuning_type full
--output_dir saves/LLama3.1-8B/full/train_2024-11-14-22-43-17
--overwrite_cache
--overwrite_output_dir
--warmup_ratio 0.03
--weight_decay 0.
--per_device_train_batch_size 4
--gradient_accumulation_steps 8
--ddp_timeout 9000
--learning_rate 2e-5
--lr_scheduler_type cosine
--cutoff_len 4096
--save_steps 400
--logging_steps 1
--plot_loss
--num_train_epochs 1
--bf16
--report_to wandb

Expected behavior

SFT 微调训练完后,有什么方法能够 inference 一遍,得到每条数据对应的 loss 吗

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Nov 27, 2024
@Word2VecT
Copy link
Author

@hiyouga 求教,感谢

@hiyouga
Copy link
Owner

hiyouga commented Dec 4, 2024

https://github.com/hiyouga/LLaMA-Factory/blob/main/scripts/stat_utils/cal_ppl.py

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 4, 2024
@hiyouga hiyouga closed this as completed Dec 4, 2024
@Word2VecT
Copy link
Author

意思是得自己调用这个 python 文件是嘛

@Word2VecT
Copy link
Author

main/scripts/stat_utils/cal_ppl.py

尝试运行但是报错
12/05/2024 00:18:03 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
12/05/2024 00:18:03 - INFO - llamafactory.model.loader - all params: 7,615,616,512
0%| | 0/330 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 775, in convert_to_tensors
tensor = as_tensor(value)
^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 737, in as_tensor
return torch.tensor(value)
^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not infer dtype of NoneType

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/mnt/petrelfs/tangzinan/LLaMA-Factory/scripts/cal_ppl.py", line 137, in
fire.Fire(calculate_ppl)
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/LLaMA-Factory/scripts/cal_ppl.py", line 114, in calculate_ppl
for batch in tqdm(dataloader):
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/tqdm/std.py", line 1181, in iter
for obj in iterable:
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/data/data_collator.py", line 598, in call
batch = pad_without_fast_tokenizer_warning(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning
padded = tokenizer.pad(*pad_args, **pad_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3536, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 240, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 791, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (images in this case) have excessive nesting (inputs type list where type int is expected).

@hiyouga hiyouga mentioned this issue Dec 5, 2024
2 tasks
@hiyouga
Copy link
Owner

hiyouga commented Dec 5, 2024

@Word2VecT fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants