The performance decreases seriously after finetuning on qwen2.5-Omni model with lora #8146

humble-gambler · 2025-05-23T11:42:52Z

Reminder

I have read the above rules and searched the existing issues.

System Info

I tried to use omni model to do emotion recognition. The fine-tuning dataset is relatively simple, the label is directly used as the assistant's response for autoregressive training. During fine-tuning, the training set loss quickly dropped to 0, but when the training set was tested again, the classification accuracy became very low, so there was no overfitting. When it comes to test set, the prediction scores dropped from 0.5+ to 0.2+ and after fine-tuning, many labels that were originally predicted correctly were predicted incorrectly.

Reproduction

The sample of training set after tokenizing is belows:

input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 27076, 304, 7802, 533, 24231, 13, 16246, 279, 1946, 2766, 11, 697, 3383, 374, 311, 18649, 437, 279, 21261, 13302, 553, 279, 4541, 2341, 12856, 304, 279, 2766, 11, 23643, 279, 7966, 4815, 432, 323, 5889, 279, 4541, 2341, 12856, 29381, 2652, 13, 151645, 198, 151644, 872, 198, 151652, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151653, 576, 1697, 304, 279, 2766, 2727, 25, 3555, 525, 498, 7598, 4607, 20205, 389, 279, 79049, 57597, 1946, 11, 8253, 279, 14269, 1584, 6839, 304, 279, 2766, 13, 4615, 2550, 1969, 387, 1172, 825, 19772, 2383, 25470, 11882, 504, 279, 2701, 1140, 25, 6247, 11, 12421, 11, 20628, 11, 18514, 11, 12761, 11, 67062, 11, 8679, 13, 151645, 198, 151644, 77091, 198, 4243, 70, 590, 151645, 198]
inputs:
<|im_start|>system
You are a helpful assistant specialized in affective computing. Given the input video, your task is to undertand the emotions expressed by the active spearker in the video, analyze the reasons behind it and respond the active spearker compassionately.<|im_end|>
<|im_start|>user
<|vision_bos|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|vision_eos|> The person in the video says: What are you guys?. Based on the multimodal input, determine the emotional state shown in the video. Your output must be only one emotion label strictly chosen from the following list: happy, sad, neutral, angry, surprise, disgust, fear.<|im_end|>
<|im_start|>assistant
disgust<|im_end|>

label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 4243, 70, 590, 151645, 198]
labels:
disgust<|im_end|>

and the loss png:

Others

Thanks for helping.

The text was updated successfully, but these errors were encountered:

Kuangdd01 · 2025-05-23T14:20:02Z

maybe overfit?

hiyouga · 2025-05-26T16:58:38Z

@Kuangdd01 not sure, the acc on training set was decreased too

Kuangdd01 · 2025-05-27T05:00:12Z

Can you provide the training script? @humble-gambler
And what predictions on the training set look like?

humble-gambler · 2025-05-27T05:11:04Z

Can you provide the training script? @humble-gambler

Sure, the training script is below. I also try use sooo small learning rate like 1.0e-7. The loss curve is relatively volatile and does not drop that fast. However, it still gets worse performance compared to the original model but not decrease so seriously. It seems that the fine-tuning doesn't work. So, it is a little wired.

### model
model_name_or_path: ../Qwen/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_alpha: 16
#lora_target: q_proj, v_proj
lora_target: all

### dataset
dataset: sft_DFEW_pretrain_data, sft_MER2025_pretrain_data, sft_MELD_pretrain_data
template: qwen2_omni
cutoff_len: 3072
#max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 4
dataloader_num_workers: 4


### output
output_dir: saves/qwen2_omni-7b/lora/sft_pretrain_full_data
logging_steps: 10
save_steps: 2000
plot_loss: true
overwrite_output_dir: true
save_only_model: false

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
freeze_vision_tower: true
freeze_multi_modal_projector: true
learning_rate: 5.0e-5
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
flash_attn: auto 


### eval
val_size: 0.1 
per_device_eval_batch_size: 1
eval_strategy: epoch            
eval_steps: 1            

# new append
use_audio_in_video: true

Kuangdd01 · 2025-05-27T06:21:44Z

Emmm, the lr is fine.
Q1: Can you show the predictions on a part of the training set? It is an easy classification task.
As we can see, the loss is extremely low. Do predictions appear to contain some abnormal tokens?
Q2: Does this dataset contain video-audio data?

humble-gambler · 2025-05-27T06:34:28Z

Emmm, the lr is fine. Q1: Can you show the predictions on a part of the training set? It is an easy classification task. As we can see, the loss is extremely low. Do predictions appear to contain some abnormal tokens? Q2: Does this dataset contain video-audio data?

Q1: The prediction is simple, I add the prompt "You should output the emotion label by using the following format: [emotion label]". Therefore, the predictions don't contain abnormal tokens. It just simply outputs the label, such as :[sadness]、[anger]. So, it can output the right format, but sometimes wrong label. (BTW, can I output the prediction token when using LLaMA-Factory during the fine-tune stage, what should I set?)

Q2: Yes, it contains video data.

Kuangdd01 · 2025-05-27T07:39:31Z

You can save several Lora adapters, then do prediction after training.
If we do not add an extra prompt like "You should output the emotion label by using the following format: [emotion label]", will performance get better?
I can't figure out why fine-tuning leads to worse performance even on the training set when the loss even drops to zero. Maybe inputs differed between training and inference?

humble-gambler · 2025-05-27T09:16:27Z

You can save several Lora adapters, then do prediction after training. If we do not add an extra prompt like "You should output the emotion label by using the following format: [emotion label]", will performance get better? I can't figure out why fine-tuning leads to worse performance even on the training set when the loss even drops to zero. Maybe inputs differed between training and inference?

I think that the extra prompt is not the key reason. The inputs are also not different.
To figure out the reason, I tried to use PEFT to fine-tune. I add a PEFT lora config on the original Omni model and plus a classifier to finish the classification task.
I find that, the lora weight has the attribute "requires_grad=True", however, it returns nothing grad.

classifier.backbone.base_model.model.model.layers[0].self_attn.q_proj.lora_A["default"].weight:

Parameter containing:
tensor([[-0.0049,  0.0132, -0.0155,  ..., -0.0128, -0.0017,  0.0145],
        [ 0.0096, -0.0107,  0.0045,  ...,  0.0023,  0.0104,  0.0049],
        [ 0.0143,  0.0016, -0.0082,  ..., -0.0006, -0.0112, -0.0137],
        ...,
        [-0.0134, -0.0078, -0.0045,  ..., -0.0114,  0.0130,  0.0076],
        [ 0.0020, -0.0055, -0.0035,  ...,  0.0119, -0.0068,  0.0117],
        [ 0.0148,  0.0027,  0.0038,  ...,  0.0151, -0.0150, -0.0032]],
       device='cuda:0', dtype=torch.bfloat16, requires_grad=True)

classifier.backbone.base_model.model.model.layers[0].self_attn.q_proj.lora_A["default"].weight.grad:

None

where, my created classifier works normally.

classifier.classifier[1].weight: 
Parameter containing:
tensor([[-0.0030, -0.0018, -0.0181,  ..., -0.0151, -0.0042, -0.0045],
        [ 0.0388, -0.0291,  0.0386,  ...,  0.0017, -0.0027,  0.0119],
        [ 0.0166,  0.0007,  0.0219,  ..., -0.0141,  0.0043, -0.0089],
        ...,
        [-0.0082,  0.0044,  0.0145,  ..., -0.0015,  0.0049,  0.0056],
        [ 0.0061,  0.0009, -0.0034,  ...,  0.0153,  0.0006, -0.0114],
        [ 0.0009, -0.0078, -0.0099,  ...,  0.0044,  0.0015,  0.0032]],
       device='cuda:0', dtype=torch.bfloat16, requires_grad=True)

classifier.classifier[1].weight.grad:

tensor([[ 2.4289e-06, -2.0117e-06,  0.0000e+00,  ..., -0.0000e+00,
          0.0000e+00,  4.2282e-07],
        [-4.6730e-04,  3.8528e-04, -0.0000e+00,  ...,  0.0000e+00,
         -0.0000e+00, -8.1062e-05],
        [-4.5395e-04,  3.7384e-04, -0.0000e+00,  ...,  0.0000e+00,
         -0.0000e+00, -7.8678e-05],
        ...,
        [ 1.3447e-04, -1.1063e-04,  0.0000e+00,  ..., -0.0000e+00,
          0.0000e+00,  2.3246e-05],
        [-9.1076e-05,  7.5340e-05, -0.0000e+00,  ...,  0.0000e+00,
         -0.0000e+00, -1.5855e-05],
        [ 1.4901e-05, -1.2279e-05,  0.0000e+00,  ..., -0.0000e+00,
          0.0000e+00,  2.5928e-06]], device='cuda:0', dtype=torch.bfloat16)

It's wired too😂 I am not sure if some view operations in the Omni model code caused a node in the computation graph to block the gradient return, making the LLaMA-Factory fine-tuning unable to work properly. But if there is no gradient return, the performance should be the same as the original model.
(I am not sure if I used PEFT incorrectly or if it is a problem with the model 😂)

Kuangdd01 · 2025-05-27T09:40:42Z

Thanks for reporting this. I think something went wrong.
@Luffy-ZY-Wang Hi, have you encountered this issue in your case?

Luffy-ZY-Wang · 2025-05-27T10:20:04Z

Thanks for reporting this. I think something went wrong. @Luffy-ZY-Wang Hi, have you encountered this issue in your case?

TBH, I didn't encounter this issue in my case:

I could also get normal grad curve during training. But I didn't check the grad matrix mentioned above.

My training config can be found here: #7767 (comment)
with deepspeed disabled (for some unknown reason lora+ds3+omni_trainset could not work on Qwen2.5Omni)

Kuangdd01 · 2025-05-27T14:21:11Z

Thanks for reporting this. I think something went wrong. @Luffy-ZY-Wang Hi, have you encountered this issue in your case?

TBH, I didn't encounter this issue in my case:

I could also get normal grad curve during training. But I didn't check the grad matrix mentioned above.

My training config can be found here: #7767 (comment) with deepspeed disabled (for some unknown reason lora+ds3+omni_trainset could not work on Qwen2.5Omni)

Does your model perform normally after training? Because the loss curve looks similar to the above.

Luffy-ZY-Wang · 2025-05-27T14:26:44Z

Thanks for reporting this. I think something went wrong. @Luffy-ZY-Wang Hi, have you encountered this issue in your case?

TBH, I didn't encounter this issue in my case:

I could also get normal grad curve during training. But I didn't check the grad matrix mentioned above.
My training config can be found here: #7767 (comment) with deepspeed disabled (for some unknown reason lora+ds3+omni_trainset could not work on Qwen2.5Omni)

Does your model perform normally after training? Because the loss curve looks similar to the above.

Yes it performs normally as expected after training. It has improvements on different metrics such as BERTScore, BLEU and ROUGE.

humble-gambler · 2025-05-27T15:16:24Z

Thanks for reporting this. I think something went wrong. @Luffy-ZY-Wang Hi, have you encountered this issue in your case?

TBH, I didn't encounter this issue in my case:

I could also get normal grad curve during training. But I didn't check the grad matrix mentioned above.
My training config can be found here: #7767 (comment) with deepspeed disabled (for some unknown reason lora+ds3+omni_trainset could not work on Qwen2.5Omni)

Does your model perform normally after training? Because the loss curve looks similar to the above.

Yes it performs normally as expected after training. It has improvements on different metrics such as BERTScore, BLEU and ROUGE.

@Luffy-ZY-Wang @Kuangdd01 Thanks for your help. Maybe I got something wrong, I will try more.

humble-gambler added bug Something isn't working pending This problem is yet to be addressed labels May 23, 2025

hiyouga closed this as completed May 26, 2025

hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels May 26, 2025

hiyouga reopened this May 26, 2025

hiyouga added bug Something isn't working pending This problem is yet to be addressed and removed solved This problem has been already solved labels May 26, 2025

Kuangdd01 self-assigned this May 27, 2025

Kuangdd01 changed the title ~~The performance decreases seriously after finetuning on qwen2.5-Omni model.~~ The performance decreases seriously after finetuning on qwen2.5-Omni model with lora May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The performance decreases seriously after finetuning on qwen2.5-Omni model with lora #8146

The performance decreases seriously after finetuning on qwen2.5-Omni model with lora #8146

humble-gambler commented May 23, 2025

Kuangdd01 commented May 23, 2025

Uh oh!

hiyouga commented May 26, 2025

Uh oh!

Kuangdd01 commented May 27, 2025 •

edited

Loading

Uh oh!

humble-gambler commented May 27, 2025

Uh oh!

Kuangdd01 commented May 27, 2025

Uh oh!

humble-gambler commented May 27, 2025

Uh oh!

Kuangdd01 commented May 27, 2025

Uh oh!

humble-gambler commented May 27, 2025

Uh oh!

Kuangdd01 commented May 27, 2025

Uh oh!

Luffy-ZY-Wang commented May 27, 2025 •

edited

Loading

Uh oh!

Kuangdd01 commented May 27, 2025

Uh oh!

Luffy-ZY-Wang commented May 27, 2025

Uh oh!

humble-gambler commented May 27, 2025

Uh oh!

The performance decreases seriously after finetuning on qwen2.5-Omni model with lora #8146

The performance decreases seriously after finetuning on qwen2.5-Omni model with lora #8146

Comments

humble-gambler commented May 23, 2025

Reminder

System Info

Reproduction

Others

Kuangdd01 commented May 23, 2025

Uh oh!

hiyouga commented May 26, 2025

Uh oh!

Kuangdd01 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

humble-gambler commented May 27, 2025

Uh oh!

Kuangdd01 commented May 27, 2025

Uh oh!

humble-gambler commented May 27, 2025

Uh oh!

Kuangdd01 commented May 27, 2025

Uh oh!

humble-gambler commented May 27, 2025

Uh oh!

Kuangdd01 commented May 27, 2025

Uh oh!

Luffy-ZY-Wang commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kuangdd01 commented May 27, 2025

Uh oh!

Luffy-ZY-Wang commented May 27, 2025

Uh oh!

humble-gambler commented May 27, 2025

Uh oh!

Kuangdd01 commented May 27, 2025 •

edited

Loading

Luffy-ZY-Wang commented May 27, 2025 •

edited

Loading