Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPO/KTO] Mixtral Load Balancing Loss #1544

Closed
claralp opened this issue Apr 17, 2024 · 6 comments · Fixed by #1765
Closed

[DPO/KTO] Mixtral Load Balancing Loss #1544

claralp opened this issue Apr 17, 2024 · 6 comments · Fixed by #1765

Comments

@claralp
Copy link
Contributor

claralp commented Apr 17, 2024

Question towards @lewtun @kashif:

We noticed that the load balancing loss (aux_loss) that is implemented in MoEs modeling_mixtral.py#L1244 is not added to the loss implemented in DPO/KTO trainers.

Isn't this needed so that the load balancing between experts is still guaranteed after fine-tuning or does the KL loss penalization sufficiently prevents the router weights from changing too much?

@kashif
Copy link
Collaborator

kashif commented Apr 17, 2024

great question @claralp so my initial intuition was that we are changing the weight of the model and we keep the routing of the experts in-tact. If one were to also add the aux loss, I am not sure currently how that would fit into the preference tuning objective...

@PhilipMay
Copy link
Contributor

PhilipMay commented Apr 17, 2024

my initial intuition was that we are changing the weight of the model and we keep the routing of the experts in-tact.

@kashif So your suggestion is to use a lora / qlora adapter that does not target the router layers?

@claralp
Copy link
Contributor Author

claralp commented Apr 18, 2024

@kashif, so I guess it is important to set output_router_logits=False for preference tuning.
Had a look into the code and it looks like setting this does not disable the router weights from being trained.
So where do you disable the weights for routing of experts in preference tuning?
If that is a required manual step, maybe a note should be added to the trl MoE examples.

@kashif
Copy link
Collaborator

kashif commented Apr 18, 2024

one clarification... in the preference tuning losses we are not explicitly back-proping over the categorical cross entropy loss of the models but rather via the log-probs obtained from the logits... so in this case these methods are not doing anything to the router's weights in MoE models

@claralp
Copy link
Contributor Author

claralp commented Apr 22, 2024

@kashif we could observe a gradient in the gate weights even when backpropagating via the log-probs obtained from the logits (as in DPO/KTO).
If the gradient calculation is not explicitly switched off for the 'gate' weights or if they are excluded from the target_modules in the PEFT config (default), these gating weights seem to be trained by DPO/KTO

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants