Skip to content

The performance decreases seriously after finetuning on qwen2.5-Omni model with lora #8146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
humble-gambler opened this issue May 23, 2025 · 13 comments
Open
1 task done
Assignees
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@humble-gambler
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

I tried to use omni model to do emotion recognition. The fine-tuning dataset is relatively simple, the label is directly used as the assistant's response for autoregressive training. During fine-tuning, the training set loss quickly dropped to 0, but when the training set was tested again, the classification accuracy became very low, so there was no overfitting. When it comes to test set, the prediction scores dropped from 0.5+ to 0.2+ and after fine-tuning, many labels that were originally predicted correctly were predicted incorrectly.

Reproduction

The sample of training set after tokenizing is belows:

input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 27076, 304, 7802, 533, 24231, 13, 16246, 279, 1946, 2766, 11, 697, 3383, 374, 311, 18649, 437, 279, 21261, 13302, 553, 279, 4541, 2341, 12856, 304, 279, 2766, 11, 23643, 279, 7966, 4815, 432, 323, 5889, 279, 4541, 2341, 12856, 29381, 2652, 13, 151645, 198, 151644, 872, 198, 151652, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151656, 151653, 576, 1697, 304, 279, 2766, 2727, 25, 3555, 525, 498, 7598, 4607, 20205, 389, 279, 79049, 57597, 1946, 11, 8253, 279, 14269, 1584, 6839, 304, 279, 2766, 13, 4615, 2550, 1969, 387, 1172, 825, 19772, 2383, 25470, 11882, 504, 279, 2701, 1140, 25, 6247, 11, 12421, 11, 20628, 11, 18514, 11, 12761, 11, 67062, 11, 8679, 13, 151645, 198, 151644, 77091, 198, 4243, 70, 590, 151645, 198]
inputs:
<|im_start|>system
You are a helpful assistant specialized in affective computing. Given the input video, your task is to undertand the emotions expressed by the active spearker in the video, analyze the reasons behind it and respond the active spearker compassionately.<|im_end|>
<|im_start|>user
<|vision_bos|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|vision_eos|> The person in the video says: What are you guys?. Based on the multimodal input, determine the emotional state shown in the video. Your output must be only one emotion label strictly chosen from the following list: happy, sad, neutral, angry, surprise, disgust, fear.<|im_end|>
<|im_start|>assistant
disgust<|im_end|>

label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 4243, 70, 590, 151645, 198]
labels:
disgust<|im_end|>

and the loss png:

Image

Others

Thanks for helping.

@humble-gambler humble-gambler added bug Something isn't working pending This problem is yet to be addressed labels May 23, 2025
@Kuangdd01
Copy link
Collaborator

maybe overfit?

@hiyouga hiyouga closed this as completed May 26, 2025
@hiyouga hiyouga added solved This problem has been already solved and removed bug Something isn't working pending This problem is yet to be addressed labels May 26, 2025
@hiyouga hiyouga reopened this May 26, 2025
@hiyouga hiyouga added bug Something isn't working pending This problem is yet to be addressed and removed solved This problem has been already solved labels May 26, 2025
@hiyouga
Copy link
Owner

hiyouga commented May 26, 2025

@Kuangdd01 not sure, the acc on training set was decreased too

@Kuangdd01
Copy link
Collaborator

Kuangdd01 commented May 27, 2025

Can you provide the training script? @humble-gambler
And what predictions on the training set look like?

@humble-gambler
Copy link
Author

Can you provide the training script? @humble-gambler

Sure, the training script is below. I also try use sooo small learning rate like 1.0e-7. The loss curve is relatively volatile and does not drop that fast. However, it still gets worse performance compared to the original model but not decrease so seriously. It seems that the fine-tuning doesn't work. So, it is a little wired.

Image
### model
model_name_or_path: ../Qwen/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_alpha: 16
#lora_target: q_proj, v_proj
lora_target: all

### dataset
dataset: sft_DFEW_pretrain_data, sft_MER2025_pretrain_data, sft_MELD_pretrain_data
template: qwen2_omni
cutoff_len: 3072
#max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 4
dataloader_num_workers: 4


### output
output_dir: saves/qwen2_omni-7b/lora/sft_pretrain_full_data
logging_steps: 10
save_steps: 2000
plot_loss: true
overwrite_output_dir: true
save_only_model: false

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 2
freeze_vision_tower: true
freeze_multi_modal_projector: true
learning_rate: 5.0e-5
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
flash_attn: auto 


### eval
val_size: 0.1 
per_device_eval_batch_size: 1
eval_strategy: epoch            
eval_steps: 1            

# new append
use_audio_in_video: true

@Kuangdd01
Copy link
Collaborator

Emmm, the lr is fine.
Q1: Can you show the predictions on a part of the training set? It is an easy classification task.
As we can see, the loss is extremely low. Do predictions appear to contain some abnormal tokens?
Q2: Does this dataset contain video-audio data?

@humble-gambler
Copy link
Author

Emmm, the lr is fine. Q1: Can you show the predictions on a part of the training set? It is an easy classification task. As we can see, the loss is extremely low. Do predictions appear to contain some abnormal tokens? Q2: Does this dataset contain video-audio data?

Q1: The prediction is simple, I add the prompt "You should output the emotion label by using the following format: [emotion label]". Therefore, the predictions don't contain abnormal tokens. It just simply outputs the label, such as :[sadness]、[anger]. So, it can output the right format, but sometimes wrong label. (BTW, can I output the prediction token when using LLaMA-Factory during the fine-tune stage, what should I set?)

Q2: Yes, it contains video data.

@Kuangdd01
Copy link
Collaborator

You can save several Lora adapters, then do prediction after training.
If we do not add an extra prompt like "You should output the emotion label by using the following format: [emotion label]", will performance get better?
I can't figure out why fine-tuning leads to worse performance even on the training set when the loss even drops to zero. Maybe inputs differed between training and inference?

@Kuangdd01 Kuangdd01 self-assigned this May 27, 2025
@humble-gambler
Copy link
Author

You can save several Lora adapters, then do prediction after training. If we do not add an extra prompt like "You should output the emotion label by using the following format: [emotion label]", will performance get better? I can't figure out why fine-tuning leads to worse performance even on the training set when the loss even drops to zero. Maybe inputs differed between training and inference?

I think that the extra prompt is not the key reason. The inputs are also not different.
To figure out the reason, I tried to use PEFT to fine-tune. I add a PEFT lora config on the original Omni model and plus a classifier to finish the classification task.
I find that, the lora weight has the attribute "requires_grad=True", however, it returns nothing grad.

classifier.backbone.base_model.model.model.layers[0].self_attn.q_proj.lora_A["default"].weight:

Parameter containing:
tensor([[-0.0049,  0.0132, -0.0155,  ..., -0.0128, -0.0017,  0.0145],
        [ 0.0096, -0.0107,  0.0045,  ...,  0.0023,  0.0104,  0.0049],
        [ 0.0143,  0.0016, -0.0082,  ..., -0.0006, -0.0112, -0.0137],
        ...,
        [-0.0134, -0.0078, -0.0045,  ..., -0.0114,  0.0130,  0.0076],
        [ 0.0020, -0.0055, -0.0035,  ...,  0.0119, -0.0068,  0.0117],
        [ 0.0148,  0.0027,  0.0038,  ...,  0.0151, -0.0150, -0.0032]],
       device='cuda:0', dtype=torch.bfloat16, requires_grad=True)
classifier.backbone.base_model.model.model.layers[0].self_attn.q_proj.lora_A["default"].weight.grad:

None

where, my created classifier works normally.

classifier.classifier[1].weight: 
Parameter containing:
tensor([[-0.0030, -0.0018, -0.0181,  ..., -0.0151, -0.0042, -0.0045],
        [ 0.0388, -0.0291,  0.0386,  ...,  0.0017, -0.0027,  0.0119],
        [ 0.0166,  0.0007,  0.0219,  ..., -0.0141,  0.0043, -0.0089],
        ...,
        [-0.0082,  0.0044,  0.0145,  ..., -0.0015,  0.0049,  0.0056],
        [ 0.0061,  0.0009, -0.0034,  ...,  0.0153,  0.0006, -0.0114],
        [ 0.0009, -0.0078, -0.0099,  ...,  0.0044,  0.0015,  0.0032]],
       device='cuda:0', dtype=torch.bfloat16, requires_grad=True)
classifier.classifier[1].weight.grad:

tensor([[ 2.4289e-06, -2.0117e-06,  0.0000e+00,  ..., -0.0000e+00,
          0.0000e+00,  4.2282e-07],
        [-4.6730e-04,  3.8528e-04, -0.0000e+00,  ...,  0.0000e+00,
         -0.0000e+00, -8.1062e-05],
        [-4.5395e-04,  3.7384e-04, -0.0000e+00,  ...,  0.0000e+00,
         -0.0000e+00, -7.8678e-05],
        ...,
        [ 1.3447e-04, -1.1063e-04,  0.0000e+00,  ..., -0.0000e+00,
          0.0000e+00,  2.3246e-05],
        [-9.1076e-05,  7.5340e-05, -0.0000e+00,  ...,  0.0000e+00,
         -0.0000e+00, -1.5855e-05],
        [ 1.4901e-05, -1.2279e-05,  0.0000e+00,  ..., -0.0000e+00,
          0.0000e+00,  2.5928e-06]], device='cuda:0', dtype=torch.bfloat16)

It's wired too😂 I am not sure if some view operations in the Omni model code caused a node in the computation graph to block the gradient return, making the LLaMA-Factory fine-tuning unable to work properly. But if there is no gradient return, the performance should be the same as the original model.
(I am not sure if I used PEFT incorrectly or if it is a problem with the model 😂)

@Kuangdd01
Copy link
Collaborator

Thanks for reporting this. I think something went wrong.
@Luffy-ZY-Wang Hi, have you encountered this issue in your case?

@Luffy-ZY-Wang
Copy link

Luffy-ZY-Wang commented May 27, 2025

Thanks for reporting this. I think something went wrong. @Luffy-ZY-Wang Hi, have you encountered this issue in your case?

TBH, I didn't encounter this issue in my case:

Image

I could also get normal grad curve during training. But I didn't check the grad matrix mentioned above.

My training config can be found here: #7767 (comment)
with deepspeed disabled (for some unknown reason lora+ds3+omni_trainset could not work on Qwen2.5Omni)

@Kuangdd01 Kuangdd01 changed the title The performance decreases seriously after finetuning on qwen2.5-Omni model. The performance decreases seriously after finetuning on qwen2.5-Omni model with lora May 27, 2025
@Kuangdd01
Copy link
Collaborator

Thanks for reporting this. I think something went wrong. @Luffy-ZY-Wang Hi, have you encountered this issue in your case?

TBH, I didn't encounter this issue in my case:

Image

I could also get normal grad curve during training. But I didn't check the grad matrix mentioned above.

My training config can be found here: #7767 (comment) with deepspeed disabled (for some unknown reason lora+ds3+omni_trainset could not work on Qwen2.5Omni)

Does your model perform normally after training? Because the loss curve looks similar to the above.

@Luffy-ZY-Wang
Copy link

Thanks for reporting this. I think something went wrong. @Luffy-ZY-Wang Hi, have you encountered this issue in your case?

TBH, I didn't encounter this issue in my case:
Image
I could also get normal grad curve during training. But I didn't check the grad matrix mentioned above.
My training config can be found here: #7767 (comment) with deepspeed disabled (for some unknown reason lora+ds3+omni_trainset could not work on Qwen2.5Omni)

Does your model perform normally after training? Because the loss curve looks similar to the above.

Yes it performs normally as expected after training. It has improvements on different metrics such as BERTScore, BLEU and ROUGE.

@humble-gambler
Copy link
Author

Thanks for reporting this. I think something went wrong. @Luffy-ZY-Wang Hi, have you encountered this issue in your case?

TBH, I didn't encounter this issue in my case:
Image
I could also get normal grad curve during training. But I didn't check the grad matrix mentioned above.
My training config can be found here: #7767 (comment) with deepspeed disabled (for some unknown reason lora+ds3+omni_trainset could not work on Qwen2.5Omni)

Does your model perform normally after training? Because the loss curve looks similar to the above.

Yes it performs normally as expected after training. It has improvements on different metrics such as BERTScore, BLEU and ROUGE.

@Luffy-ZY-Wang @Kuangdd01 Thanks for your help. Maybe I got something wrong, I will try more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

4 participants