PPO训练完,模型的答案和训练之前的结果一模一样? #8105
Unanswered
yaya159456
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Reminder
System Info
奖励模型的类型我选择的API的形式给出得分,然后微调类型选的lora,PPO训练后发现损失是下降的,就与基座模型做了合并,然后基于合并后的模型推理,为啥返回的答案和基座模型一样啊?
Reproduction
配置文件如下:



Others
奖励模型的类型我选择的API的形式给出得分,然后微调类型选的lora,PPO训练后发现损失是下降的,就与基座模型做了合并,然后基于合并后的模型推理,为啥返回的答案和基座模型一样啊?
Beta Was this translation helpful? Give feedback.
All reactions