Skip to content

Qwen-Omni在混合模态数据上dpo训练时,训练卡住 #8151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
wwfnb opened this issue May 25, 2025 · 4 comments
Open
1 task done

Qwen-Omni在混合模态数据上dpo训练时,训练卡住 #8151

wwfnb opened this issue May 25, 2025 · 4 comments
Assignees
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@wwfnb
Copy link

wwfnb commented May 25, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

[2025-05-25 15:24:24,879] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 05-25 15:24:26 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 05-25 15:24:26 [init.py:239] Automatically detected platform cuda.

  • llamafactory version: 0.9.3.dev0
  • Platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.35
  • Python version: 3.11.0
  • PyTorch version: 2.6.0+cu124 (GPU)
  • Transformers version: 4.50.0.dev0
  • Datasets version: 3.5.0
  • Accelerate version: 1.6.0
  • PEFT version: 0.15.1
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-80GB
  • GPU number: 8
  • GPU memory: 79.32GB
  • DeepSpeed version: 0.16.7
  • vLLM version: 0.8.5
  • Git commit: 75d7c35

Reproduction

**数据集定义如下**: "OURDATA_ALL_MIX_DPO": {
        "file_name": "OURDATA_ALL_MIX_DPO.json",
        "formatting": "sharegpt",
        "ranking": true,
        "columns": {
            "messages": "conversations",
            "chosen": "chosen",
            "rejected": "rejected",
            "images": "images",
            "audios": "audios",
            "videos": "videos"
        }
    },
**训练配置如下:**
### model
model_name_or_path: "/data/LLaMA-Factory/saves/ANSWER_POSITION/ourdata_as_cot_others_no_cot_1/checkpoint-1200/7168final" # ourdata sft基础上训练
image_max_pixels: 262144
video_max_pixels: 16384
video_fps: 1
video_maxlen: 48 # 视频输入最大总帧数 # 原版是针对单视频的,后修改源码改成了多视频时按视频时长分配maxlen
trust_remote_code: true

### method
stage: dpo
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: OURDATA_ALL_MIX_DPO
template: qwen2_omni
cutoff_len: 4096
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 100
dataloader_num_workers: 100

### output
output_dir: saves/ourdata_dpo/OURDATA_ALL_MIX_DPO
logging_steps: 10
save_steps: 200
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1 # 1
gradient_accumulation_steps: 16 # 16
learning_rate: 5.0e-6
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
**问题**
当使用如上的配置以及数据集在Qwen-omni-7B上进行dpo训练时,训练会一直卡在第一个step。 数据集是多种模态混合的,具体来说, 会包含 Text->Text, Image+Text->Text, Video+Text -> Text, Audio+Text -> Text, 这四种数据组合。 在这个训练之前,我们已经尝试了对Qwen-Omni-7B这四种数据上分别使用dpo训练, 均可以成功训练,但是当我们将这四中数据混合起来一起对Qwen-Omni-7B进行dpo训练时, 便会一直卡住第一个step。 我们尝试了: 1. 将video_fps 降低到1,以及设置 video_maxlen: 48。 但是这个不起作用。 2. 在训练之前设置环境变量, export NCCL_P2P_LEVEL=NVL。 这个也不起作用。 
我们不清楚,是当前不支持这种混合数据的dpo训练方式吗, 在sft中使用这种混合数据训练是没有问题的。

Others

以下是数据集中的一些训练样本:
Audio+Text -> Text:
{"conversations": [{"from": "human", "value": "You are a fair, professional, and neutral AI evaluator. \nYou are tasked with evaluating two different responses generated for the same user query, and determining which one is better.\n\n###Final Output Format\nPlease think step by step. First, give the analysis, and then make the final judgment.\noutput the final verdict in the following format:\nIf you believe Assistant A's response is better, output: The final verdict is [[A]].\nIf you believe Assistant B's response is better, output: The final verdict is [[B]].\n###Input Format\n[User Question]\nWhat are the acoustic features of the Kung Fu style punch noises in the audio clip?\n[The Start of Assistant A's Answer]\nThe punch sounds in the Kung Fu style have the following characteristics:\n\n- Sound source: human voice\n- Action: punching\n- Style: Kung Fu\n[The End of Assistant A's Answer]\n[The Start of Assistant B's Answer]\nThe punch sounds have a low quality, noisy recording, are mono and sound like they were recorded with a phone.\n[The End of Assistant B's Answer]"}], "chosen": {"from": "gpt", "value": "1. Factual Accuracy: This criterion evaluates the correctness and reliability of the information provided. It matters because accurate information is crucial for understanding the acoustic features of the sounds. Assistant A provides a more factual description of the sound source and action, while Assistant B focuses on the recording quality and source.Based on the provided criterion of factual accuracy, Assistant B's response is more accurate. Assistant B correctly identifies the recording quality as low, noisy, and mono, and notes that the recording seems to have been made with a phone. This provides a clear and factual description of the acoustic features of the punch sounds. In contrast, Assistant A's response includes a factual error by stating that the sound source is a human voice, which is not accurate given the nature of the punch sounds.\n\n#\n2. Detail Level: This criterion assesses the depth of information provided about the acoustic features. It is important as it helps users understand the nuances of the sounds. Assistant A gives a broad overview of the sound source and action, while Assistant B delves into the specifics of the recording quality and source.Based on the criterion provided, which focuses on the detail level of the information provided about the acoustic features of the Kung Fu style punch noises, Assistant B's response is more detailed. Assistant B specifies the recording quality as low and noisy, mentions that the recording is mono, and notes that it seems to have been recorded with a phone. This level of detail provides a clearer picture of the acoustic features of the sounds.\n\nTherefore,\n3. Relevance to Query: This criterion evaluates how well the responses address the specific question asked. It is important because it ensures that the information provided is directly useful to the user. Assistant A focuses on the style and action, while Assistant B focuses on the recording quality and source.Based on the provided criterion, which focuses on the relevance to the query, Assistant B's response is more aligned with the user's question. The user asked about the acoustic features of the Kung Fu style punch noises, and Assistant B's response directly addresses this by mentioning the low quality, noisy recording, mono sound, and the fact that it seems to have been recorded with a phone. These details are directly relevant to the acoustic features of the sounds.\n\nAssistant A's response, while providing some information, does not delve into the acoustic characteristics of the sounds as requested by the user. It focuses more on the style and action rather than the acoustic features.\n\n#\n. The final verdict is [[B]]."}, "rejected": {"from": "gpt", "value": "1. Factual Accuracy: : This criterion evaluates the correctness and precision of the information provided. It is important because it directly affects the reliability of the response. Assistant A provides specific details about the sound source and action, which are factual, while Assistant B focuses on the recording quality and source, which are also factual but presented differently.# Analysis\n\n### Factual Accuracy\n- Assistant A: Provides specific details about the sound source (human voice), action (punching), and style (Kung Fu). These details are factual and directly answer the user's query about the acoustic features.\n- Assistant B: Focuses on the recording quality (low quality, noisy), the mono nature of the recording, and the source (phone). While these points are also factual, they do not directly address the acoustic features of the Kung Fu style punch sounds.\n\n### Verdict\nBased on the criterion of factual accuracy, Assistant A's response is more aligned with the user's query, as it provides specific acoustic features of the Kung Fu style punch sounds. Therefore,\n2. Detail Level: : This criterion assesses the depth and specificity of the information provided. It is important because it can help users understand the nuances of the acoustic features. Assistant A offers a more detailed breakdown of the sound characteristics, whereas Assistant B provides a more general overview of the recording quality and source.Based on the criterion of detail level, Assistant A provides a more detailed breakdown of the sound characteristics, including the sound source, action, and style. This level of detail helps to understand the nuances of the acoustic features of the Kung Fu style punch sounds. In contrast, Assistant B offers a more general overview focusing on the recording quality and source, which, while relevant, does not delve into the specific acoustic features as comprehensively.\n\n#\n3. Relevance to Query: : This criterion evaluates how directly the response addresses the user's question about the acoustic features of the Kung Fu style punch sounds. It matters because it ensures the information provided is pertinent to the user's needs. Assistant A's response is more relevant as it specifically mentions the style and action, which are key acoustic features, while Assistant B's response is more about the recording quality.Based on the provided criterion of relevance to the query, Assistant A's response is more directly addressing the user's question about the acoustic features of the Kung Fu style punch sounds. Assistant A mentions the sound source, action, and style, which are key aspects of the acoustic features. In contrast, Assistant B's response focuses more on the recording quality and source, which, while relevant, do not directly answer the specific question about the acoustic features of the punch sounds.\n\n#\n. The final verdict is [[A]]."}, "audios": ["/data/LLaMA-Factory/prepare/dataset/align_anything/audios/b0b1a82b-07dd-47ed-932f-a34a6484a204.wav"], "images": [], "videos": []}
Text->Text:
{"conversations": [{"from": "human", "value": "You are a fair, professional, and neutral multimodal AI evaluator. \nYou are tasked with evaluating two different multimodal responses (which may include Text, Image, Video, and Audio) generated for the same user query, and determining which one is better. \nBased on the overall analysis, clearly determine which response is superior. \n\n\n###Important Notes:\nStay completely neutral: Do not be influenced by the order, length, writing style, or the assistant\u2019s name. \nDo not favor responses simply because they are longer or use more elaborate language. \n\n\n###Final Output Format\nAfter completing the full analysis, you must output the final verdict in the following format:\nIf you believe Assistant A\u2019s response is better, output: [[A]]\nIf you believe Assistant B\u2019s response is better, output: [[B]]. \n\n\n###Input Format\n[User Question]\n[Question]Wanda has 62 crayons. Dina has 28 and Jacob has two fewer crayons than Dina. How many crayons do they have in total?\n[Answer]Jacob has 28 - 2 = 26 crayons. You can find the total number of crayons by adding the number of crayons each person has: 26 crayons + 62 crayons + 28 crayons = 116 crayons. The answer is 116.\n\n[Question]Jill can run up a hill at 9 feet/second and down a hill at 12 feet/second. How long does it take her to run up and down a 900 foot hill?\n[Answer]First find the time it takes Jill to run up the hill: 900 feet / 9 feet / second = 100 seconds. Then find the time it takes Jill to run down the hill: 900 feet / 12 feet / second = 75 seconds. Then add the two times to find Jill's total time: 100 seconds + 75 seconds = 175 seconds. The answer is 175.\n\n[Question]Jace drives 60 miles per hour. If Jace drives for 4 hours straight, take a 30-minute break, and then drives for another 9 hours straight, how many miles will he travel?\n[Answer]Jace drives a total of 4 + 9 = 13 hours. Jace will travel 60 * 13 = 780 miles. The answer is 780.\n\n[Question]John jogs at a speed of 4 miles per hour when he runs alone, but runs at 6 miles per hour when he is being dragged by his 100-pound German Shepherd dog. If John and his dog go on a run together for 30 minutes, and then John runs for an additional 30 minutes by himself, how far will John have traveled?\n[Answer]\n\n[The Start of Assistant A's Answer]\nFirst, find the distance John runs with his dog: 6 miles/hour * 0.5 hours = 3 miles.\nNext, find the distance John runs alone: 4 miles/hour * 0.5 hours = 2 miles.\nThen add the two distances to find the total distance John travels: 3 miles + 2 miles = 5 miles. The answer is 5.\n[The End of Assistant A's Answer]\n\n[The Start of Assistant B's Answer]\nJohn covers 6 miles in 30 minutes when jogging with his dog, and he manages to cover another 4 miles in the subsequent 30 minutes of jogging alone. Therefore, John will have traveled 10 miles in total during his one-hour workout.\n[The End of Assistant B's Answer]"}], "chosen": {"from": "gpt", "value": "1. Clarity of Explanation: : This criterion assesses how clearly and coherently the steps and logic behind the calculations are explained. A clear explanation helps the user understand the process, which is important for educational purposes and for users who might not be familiar with the concepts.# Analysis\n\nAssistant A's response provides a step-by-step calculation for the distance John travels with his dog and then separately for the distance he travels alone. The explanation is clear and logical, breaking down the problem into two distinct parts and then summing them up to get the total distance. This approach ensures that the user can follow the process easily.\n\nAssistant B's response, however, does not provide a clear explanation of the steps taken to arrive at the answer. It simply states the distances covered in each segment of the run without showing the calculations or explaining the logic behind the numbers. This makes it less clear for the user to understand how the total distance was derived.\n2. Use of Units: : This criterion examines whether the correct units are used and maintained throughout the calculation process. Proper use of units is essential for the accuracy and relevance of the final answer, especially in contexts involving speed and distance.### Analysis\n\nAssistant A's response:\n- Provides a step-by-step calculation of the distance John travels with his dog and then separately for the distance he travels alone.\n- Uses the correct units (miles per hour and hours) and maintains them throughout the calculation.\n- The final answer is 5 miles, which is derived from the sum of the distances covered in each segment of the run.\n\nAssistant B's response:\n- States that John covers 6 miles in 30 minutes when jogging with his dog and 4 miles in 30 minutes when jogging alone.\n- Adds these distances to get a total of 10 miles.\n- However, the calculation of the distances is incorrect. John runs 6 miles per hour with his dog, so in 30 minutes (0.5 hours), he should cover 3 miles, not 6 miles. Similarly, when running alone at 4 miles per hour, he should cover 2 miles in 30 minutes, not 4 miles.\n\n### Verdict\nAssistant A's response correctly calculates the distances and maintains proper use of units, leading to the accurate final answer of 5 miles. Assistant B's response contains errors in the calculation of distances and does not properly use units.\n3. Step-by-Step Breakdown: : This criterion looks at whether the response provides a detailed, step-by-step breakdown of the calculation process. A detailed breakdown can help users follow the logic and verify each step, which is particularly useful for complex problems.# Analysis\n\nAssistant A's response provides a detailed, step-by-step breakdown of the calculation process. It clearly outlines the distance John runs with his dog and the distance he runs alone, and then adds these distances to find the total distance traveled. This approach adheres well to the criterion of providing a detailed, step-by-step breakdown.\n\nAssistant B's response, on the other hand, does not provide a detailed breakdown of the calculation process. It states the distances covered but does not show the calculation steps, which makes it less clear and less helpful for users who might need to verify each step.\n\n\nThe final verdict is [[A]]."}, "rejected": {"from": "gpt", "value": "1. Calculation Accuracy: This criterion evaluates the correctness of the mathematical calculations presented in the responses. It is crucial because even a small error in calculation can lead to an incorrect final answer, which would mislead the user.### Analysis\n\n#### Calculation Accuracy\n- Assistant A's Response: The response calculates the distance John travels with his dog as 6 miles/hour * 0.5 hours = 3 miles, and the distance he travels alone as 4 miles/hour * 0.5 hours = 2 miles. The total distance is then 3 miles + 2 miles = 5 miles. However, this calculation is incorrect because the time for jogging with the dog and jogging alone is not correctly accounted for. The time for jogging with the dog is 30 minutes (0.5 hours), and the time for jogging alone is also 30 minutes (0.5 hours). The total distance should be 6 miles + 4 miles = 10 miles.\n- Assistant B's Response: The response correctly calculates the distance John travels with his dog as 6 miles in 30 minutes and the distance he travels alone as 4 miles in 30 minutes. The total distance is 6 miles + 4 miles = 10 miles. This calculation is accurate.\n\n##\n2. Final Answer Presentation: This criterion evaluates how the final answer is presented. It is important because a clear and concise presentation of the final answer helps the user quickly identify the solution to the problem without having to sift through the entire explanation.# Analysis\nAssistant A's response calculates the distance John travels with his dog and the distance he travels alone, then adds these distances to find the total distance traveled. However, the calculation for the distance traveled alone is incorrect. John jogs at 4 miles per hour for 0.5 hours, which equals 2 miles, not 4 miles as stated. Therefore, the total distance should be 3 miles (with the dog) + 2 miles (alone) = 5 miles, not 10 miles.\n\nAssistant B's response correctly calculates the distance John travels with his dog (6 miles) and the distance he travels alone (4 miles), and adds these distances to find the total distance traveled, which is 10 miles. The final answer is clearly presented and correct.\n3. Consistency in Units: This criterion checks whether the units used in the calculations are consistent and correctly applied throughout the response. Consistency in units is important to avoid confusion and ensure the accuracy of the final answer.### Analysis\n\nAssistant A's response:\n- Calculates the distance John runs with his dog: 6 miles/hour * 0.5 hours = 3 miles.\n- Calculates the distance John runs alone: 4 miles/hour * 0.5 hours = 2 miles.\n- Adds the two distances to find the total distance John travels: 3 miles + 2 miles = 5 miles.\n- The response is consistent in units, using miles and hours correctly.\n\nAssistant B's response:\n- States that John covers 6 miles in 30 minutes when jogging with his dog.\n- States that John manages to cover another 4 miles in the subsequent 30 minutes of jogging alone.\n- Adds the two distances to find the total distance John travels: 6 miles + 4 miles = 10 miles.\n- The response is also consistent in units, using miles and hours correctly.\n\n### Verdict\nBoth responses are consistent in units and correctly apply the units throughout the response. However, Assistant B's response provides a more accurate total distance by correctly summing the distances covered in each segment of the run. Assistant A's response incorrectly calculates the distance John runs alone, leading to an incorrect total distance.\n\nTherefore,\n\n\nThe final verdict is [[B]]."}, "images": [], "audios": [], "videos": []}

@wwfnb wwfnb added bug Something isn't working pending This problem is yet to be addressed labels May 25, 2025
@Kuangdd01
Copy link
Collaborator

Kuangdd01 commented May 25, 2025

hi能试一下zero2看下报错情况吗 zero3 hang住的问题可能一下看不太出来, dpo data_collocator里面没有fake_inputs行为,所以z3的时候会因为设备上不同的梯度信息导致会hang住 @wwfnb

@Kuangdd01 Kuangdd01 self-assigned this May 25, 2025
@wwfnb
Copy link
Author

wwfnb commented May 25, 2025

hi能试一下zero2看下报错情况吗 zero3 hang住的问题可能一下看不太出来, dpo data_collocator里面没有fake_inputs行为,所以z3的时候会因为设备上不同的梯度信息导致会hang住 @wwfnb

@Kuangdd01 好的, 但是可能需要明天才能看到结果,现在没有可以用的GPU。

@phanxuanphucnd
Copy link

How to train qwen omni with other languages?

Regarding issues such as speech tokenzie, ... is it necessary to extend vocab? and how to supplement discrete units for other languages?

Can anyone help me with a solution?

@Kuangdd01
Copy link
Collaborator

How to train qwen omni with other languages?

Regarding issues such as speech tokenzie, ... is it necessary to extend vocab? and how to supplement discrete units for other languages?

Can anyone help me with a solution?

If I don't misunderstand, your question is similar to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

3 participants