You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While training qwen2.5-vl-7B with sft on video data, I encountered the following issue during the preprocessing of the dataset:
[rank0]: multiprocess.pool.RemoteTraceback:
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3475, in iter_outputs
[rank0]: yield i, apply_function(example, i, offset=offset)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/opt/conda/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3398, in apply_function
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/src/llamafactory/data/processor/supervised.py", line 99, in preprocess_dataset
[rank0]: input_ids, labels = self._encode_data_example(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/src/llamafactory/data/processor/supervised.py", line 43, in _encode_data_example
[rank0]: messages = self.template.mm_plugin.process_messages(prompt + response, images, videos, audios, self.processor)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/src/llamafactory/data/mm_plugin.py", line 1454, in process_messages
[rank0]: mm_inputs = self._get_mm_inputs(images, videos, audios, processor)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/src/llamafactory/data/mm_plugin.py", line 1423, in _get_mm_inputs
[rank0]: video_data = self._regularize_videos(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/app/src/llamafactory/data/mm_plugin.py", line 1384, in _regularize_videos
[rank0]: video_stream = next(stream for stream in container.streams if stream.type == "video")
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: StopIteration
Although I ran some ffmpeg checks on the videos I had, the stream issue wasn't captured and some buggy videos were in the training jsonl.
Feature request: Can we directly ignore videos/media that face issues during the preprocessing stage to avoid unwanted errors and halt the overall training? This is useful when processing large datasets since it saves a lot of time instead of finding and removing the media for unknown errors.
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
Reproduction
While training qwen2.5-vl-7B with sft on video data, I encountered the following issue during the preprocessing of the dataset:
Although I ran some ffmpeg checks on the videos I had, the stream issue wasn't captured and some buggy videos were in the training jsonl.
Feature request: Can we directly ignore videos/media that face issues during the preprocessing stage to avoid unwanted errors and halt the overall training? This is useful when processing large datasets since it saves a lot of time instead of finding and removing the media for unknown errors.
Others
No response
The text was updated successfully, but these errors were encountered: