[DPO/KTO] Mixtral Load Balancing Loss #1544

claralp · 2024-04-17T11:43:13Z

We noticed that the load balancing loss (aux_loss) that is implemented in MoEs modeling_mixtral.py#L1244 is not added to the loss implemented in DPO/KTO trainers.

Isn't this needed so that the load balancing between experts is still guaranteed after fine-tuning or does the KL loss penalization sufficiently prevents the router weights from changing too much?

kashif · 2024-04-17T13:16:51Z

great question @claralp so my initial intuition was that we are changing the weight of the model and we keep the routing of the experts in-tact. If one were to also add the aux loss, I am not sure currently how that would fit into the preference tuning objective...

PhilipMay · 2024-04-17T15:27:28Z

my initial intuition was that we are changing the weight of the model and we keep the routing of the experts in-tact.

@kashif So your suggestion is to use a lora / qlora adapter that does not target the router layers?

claralp · 2024-04-18T08:34:31Z

@kashif, so I guess it is important to set output_router_logits=False for preference tuning.
Had a look into the code and it looks like setting this does not disable the router weights from being trained.
So where do you disable the weights for routing of experts in preference tuning?
If that is a required manual step, maybe a note should be added to the trl MoE examples.

kashif · 2024-04-18T10:41:59Z

one clarification... in the preference tuning losses we are not explicitly back-proping over the categorical cross entropy loss of the models but rather via the log-probs obtained from the logits... so in this case these methods are not doing anything to the router's weights in MoE models

claralp · 2024-04-22T20:32:32Z

@kashif we could observe a gradient in the gate weights even when backpropagating via the log-probs obtained from the logits (as in DPO/KTO).
If the gradient calculation is not explicitly switched off for the 'gate' weights or if they are excluded from the target_modules in the PEFT config (default), these gating weights seem to be trained by DPO/KTO

github-actions · 2024-05-17T15:05:13Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions bot closed this as completed May 26, 2024

claralp mentioned this issue Jun 23, 2024

MoE Models: option to add load balancing loss #1765

Merged

kashif reopened this Jun 23, 2024

kashif closed this as completed in #1765 Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPO/KTO] Mixtral Load Balancing Loss #1544

[DPO/KTO] Mixtral Load Balancing Loss #1544

claralp commented Apr 17, 2024 •

edited

Loading

kashif commented Apr 17, 2024

PhilipMay commented Apr 17, 2024 •

edited

Loading

claralp commented Apr 18, 2024 •

edited

Loading

kashif commented Apr 18, 2024 •

edited

Loading

claralp commented Apr 22, 2024

github-actions bot commented May 17, 2024

[DPO/KTO] Mixtral Load Balancing Loss #1544

[DPO/KTO] Mixtral Load Balancing Loss #1544

Comments

claralp commented Apr 17, 2024 • edited Loading

kashif commented Apr 17, 2024

PhilipMay commented Apr 17, 2024 • edited Loading

claralp commented Apr 18, 2024 • edited Loading

kashif commented Apr 18, 2024 • edited Loading

claralp commented Apr 22, 2024

github-actions bot commented May 17, 2024

claralp commented Apr 17, 2024 •

edited

Loading

PhilipMay commented Apr 17, 2024 •

edited

Loading

claralp commented Apr 18, 2024 •

edited

Loading

kashif commented Apr 18, 2024 •

edited

Loading