如何得到每条数据的 loss #6165

Word2VecT · 2024-11-27T13:25:12Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.11.0
PyTorch version: 2.4.1+cu121 (GPU)
Transformers version: 4.45.2
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A100-SXM4-80GB
DeepSpeed version: 0.15.3

Reproduction

torchrun --nnodes=1 --nproc-per-node=8 src/train.py
--deepspeed examples/deepspeed/ds_z3_config.json
--stage sft
--do_train
--use_fast_tokenizer
--flash_attn fa2
--model_name_or_path /mnt/petrelfs/tangzinan/LLaMA-Factory/models/LLama3.1-8B
--dataset gsm8k_train
--template llama3
--finetuning_type full
--output_dir saves/LLama3.1-8B/full/train_2024-11-14-22-43-17
--overwrite_cache
--overwrite_output_dir
--warmup_ratio 0.03
--weight_decay 0.
--per_device_train_batch_size 4
--gradient_accumulation_steps 8
--ddp_timeout 9000
--learning_rate 2e-5
--lr_scheduler_type cosine
--cutoff_len 4096
--save_steps 400
--logging_steps 1
--plot_loss
--num_train_epochs 1
--bf16
--report_to wandb

Expected behavior

SFT 微调训练完后，有什么方法能够 inference 一遍，得到每条数据对应的 loss 吗

Others

No response

The text was updated successfully, but these errors were encountered:

Word2VecT · 2024-12-04T14:15:44Z

@hiyouga 求教，感谢

hiyouga · 2024-12-04T14:21:21Z

https://github.com/hiyouga/LLaMA-Factory/blob/main/scripts/stat_utils/cal_ppl.py

Word2VecT · 2024-12-04T14:27:34Z

意思是得自己调用这个 python 文件是嘛

Word2VecT · 2024-12-04T16:20:42Z

main/scripts/stat_utils/cal_ppl.py

尝试运行但是报错
12/05/2024 00:18:03 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
12/05/2024 00:18:03 - INFO - llamafactory.model.loader - all params: 7,615,616,512
0%| | 0/330 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 775, in convert_to_tensors
tensor = as_tensor(value)
^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 737, in as_tensor
return torch.tensor(value)
^^^^^^^^^^^^^^^^^^^
RuntimeError: Could not infer dtype of NoneType

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/mnt/petrelfs/tangzinan/LLaMA-Factory/scripts/cal_ppl.py", line 137, in
fire.Fire(calculate_ppl)
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/LLaMA-Factory/scripts/cal_ppl.py", line 114, in calculate_ppl
for batch in tqdm(dataloader):
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/tqdm/std.py", line 1181, in iter
for obj in iterable:
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 673, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/data/data_collator.py", line 598, in call
batch = pad_without_fast_tokenizer_warning(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/data/data_collator.py", line 66, in pad_without_fast_tokenizer_warning
padded = tokenizer.pad(*pad_args, **pad_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3536, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 240, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/mnt/petrelfs/tangzinan/anaconda3/envs/factory/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 791, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (images in this case) have excessive nesting (inputs type list where type int is expected).

hiyouga · 2024-12-05T03:54:59Z

@Word2VecT fixed

github-actions bot added the pending This problem is yet to be addressed label Nov 27, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 4, 2024

hiyouga closed this as completed Dec 4, 2024

hiyouga mentioned this issue Dec 5, 2024

[script] fix scripts #6242

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

如何得到每条数据的 loss #6165

如何得到每条数据的 loss #6165

Word2VecT commented Nov 27, 2024

Word2VecT commented Dec 4, 2024

Uh oh!

hiyouga commented Dec 4, 2024

Uh oh!

Word2VecT commented Dec 4, 2024

Uh oh!

Word2VecT commented Dec 4, 2024

Uh oh!

hiyouga commented Dec 5, 2024

Uh oh!

如何得到每条数据的 loss #6165

如何得到每条数据的 loss #6165

Comments

Word2VecT commented Nov 27, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

Word2VecT commented Dec 4, 2024

Uh oh!

hiyouga commented Dec 4, 2024

Uh oh!

Word2VecT commented Dec 4, 2024

Uh oh!

Word2VecT commented Dec 4, 2024

Uh oh!

hiyouga commented Dec 5, 2024

Uh oh!