add aux_loss doku entry

huggingface · Jun 24, 2024 · 9f1f72d · 9f1f72d
1 parent c6ab93d
commit 9f1f72d
Show file tree

Hide file tree

Showing 4 changed files with 32 additions and 0 deletions.
diff --git a/docs/source/cpo_trainer.mdx b/docs/source/cpo_trainer.mdx
@@ -86,6 +86,14 @@ The [RSO](https://arxiv.org/abs/2309.06657) authors propose to use a hinge loss
 
 The [IPO](https://arxiv.org/abs/2310.12036) authors provide a deeper theoretical understanding of the CPO algorithms and identify an issue with overfitting and propose an alternative loss which can be used via the `loss_type="ipo"` argument to the trainer. Note that the `beta`  parameter is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike CPO which is summed only).
 
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MoEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that it stays this way during fine-tuning, it is beneficial to add the auxiliary loss from load balancing to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...`(default: 0.001).
+
 ## Logging
 
 While training and evaluating we record the following reward metrics:

diff --git a/docs/source/dpo_trainer.mdx b/docs/source/dpo_trainer.mdx
@@ -121,6 +121,14 @@ The [RPO](https://arxiv.org/abs/2404.19733) paper implements an iterative prefer
 
 The [AOT](https://arxiv.org/abs/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for  paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size.
 
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MoEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that it stays this way during fine-tuning, it is beneficial to add the auxiliary loss from load balancing to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...`(default: 0.001).
+
 ## Logging
 
 While training and evaluating we record the following reward metrics:

diff --git a/docs/source/kto_trainer.mdx b/docs/source/kto_trainer.mdx
@@ -92,6 +92,14 @@ Given the binary signal data indicating whether a completion is desirable or und
 The [BCO](https://arxiv.org/abs/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
 The `KTOTrainer` can be switched to this loss via the `loss_type="bco"` argument.
 
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MoEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that it stays this way during fine-tuning, it is beneficial to add the auxiliary loss from load balancing to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...`(default: 0.001).
+
 ## KTOTrainer
 
 [[autodoc]] KTOTrainer

diff --git a/docs/source/orpo_trainer.md b/docs/source/orpo_trainer.md
@@ -73,6 +73,14 @@ After this one can then call:
 orpo_trainer.train()
 ```
 
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MoEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that it stays this way during fine-tuning, it is beneficial to add the auxiliary loss from load balancing to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...`(default: 0.001).
+
 ## Logging
 
 While training and evaluating we record the following reward metrics: