Skip to content

有没有简单的办法不shuffle trainning数据集 #1204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
XuanRen4470 opened this issue Oct 17, 2023 · 4 comments · Fixed by #6388
Closed

有没有简单的办法不shuffle trainning数据集 #1204

XuanRen4470 opened this issue Oct 17, 2023 · 4 comments · Fixed by #6388
Labels
solved This problem has been already solved

Comments

@XuanRen4470
Copy link

XuanRen4470 commented Oct 17, 2023

我知道trainner会shuffle,但是我不希望training数据集被shuffle。

@hiyouga hiyouga added the wontfix This will not be worked on label Oct 19, 2023
@hiyouga hiyouga closed this as completed Oct 19, 2023
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Oct 19, 2023
@histmeisah
Copy link

现在有参数配置可以不shuffle 数据集吗,想做一下课程学习

@JerryDaHeLian
Copy link

同问!

@hiyouga
Copy link
Owner

hiyouga commented Mar 7, 2024

--streaming --buffer_size 1 不会 shuffle 数据集

@hiyouga hiyouga added solved This problem has been already solved and removed wontfix This will not be worked on labels Mar 7, 2024
@hiyouga hiyouga closed this as completed Mar 7, 2024
@iaoxuesheng
Copy link

--streaming --buffer_size 1 不会 shuffle 数据集

你好,在设置 --streaming True
--buffer_size 1 \时会报错:Traceback (most recent call last):
File "src/train_bash.py", line 14, in
main()
File "src/train_bash.py", line 5, in main
run_exp()
File "/cephfs/renjinshan/work/LLaMA-Factory-0.7.0/src/llmtuner/train/tuner.py", line 33, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/cephfs/renjinshan/work/LLaMA-Factory-0.7.0/src/llmtuner/train/sft/workflow.py", line 33, in run_sft
dataset = get_dataset(model_args, data_args, training_args, stage="sft", **tokenizer_module)
File "/cephfs/renjinshan/work/LLaMA-Factory-0.7.0/src/llmtuner/data/loader.py", line 176, in get_dataset
print_function(next(iter(dataset)))
File "/cephfs/renjinshan/work/miniconda3/envs/llama_factory/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 1384, in iter
for key, example in ex_iterable:
File "/cephfs/renjinshan/work/miniconda3/envs/llama_factory/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 679, in iter
yield from self._iter()
File "/cephfs/renjinshan/work/miniconda3/envs/llama_factory/lib/python3.8/site-packages/datasets/iterable_dataset.py", line 718, in _iter
transformed_batch.update(self.function(*function_args, **self.fn_kwargs))
File "/cephfs/renjinshan/work/LLaMA-Factory-0.7.0/src/llmtuner/data/preprocess.py", line 79, in preprocess_supervised_dataset
if len(examples["prompt"][i]) % 2 != 1 or len(examples["response"][i]) != 1:
TypeError: object of type 'NoneType' has no len()

@hiyouga hiyouga marked this as a duplicate and then as not a duplicate of #7276 Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants