-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPO/KTO] Mixtral Load Balancing Loss #1544
Comments
great question @claralp so my initial intuition was that we are changing the weight of the model and we keep the routing of the experts in-tact. If one were to also add the aux loss, I am not sure currently how that would fit into the preference tuning objective... |
@kashif So your suggestion is to use a lora / qlora adapter that does not target the router layers? |
@kashif, so I guess it is important to set |
one clarification... in the preference tuning losses we are not explicitly back-proping over the categorical cross entropy loss of the models but rather via the log-probs obtained from the logits... so in this case these methods are not doing anything to the router's weights in MoE models |
@kashif we could observe a gradient in the gate weights even when backpropagating via the log-probs obtained from the logits (as in DPO/KTO). |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
Question towards @lewtun @kashif:
We noticed that the load balancing loss (aux_loss) that is implemented in MoEs modeling_mixtral.py#L1244 is not added to the loss implemented in DPO/KTO trainers.
Isn't this needed so that the load balancing between experts is still guaranteed after fine-tuning or does the KL loss penalization sufficiently prevents the router weights from changing too much?
The text was updated successfully, but these errors were encountered: