Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO / Reinforce Trainers #1540

Merged
merged 59 commits into from
May 22, 2024
Merged

PPO / Reinforce Trainers #1540

merged 59 commits into from
May 22, 2024

Conversation

vwxyzjn
Copy link
Contributor

@vwxyzjn vwxyzjn commented Apr 15, 2024

This RP supports the REINFORCE RLOO trainers in https://arxiv.org/pdf/2402.14740.pdf.

Note that REINFORCE's loss is a special case of PPO, as shown below

image

it matches the REINFORCE loss presented in the Cohere paper (where PPO uses advantages A hat, but REINFORCE uses the RLHF reward R(y, x))
image

We add the following files

  • trl/trainer/ppov2_trainer.py
  • trl/trainer/ppov2_bandit_rloo_trainer.py
    • a PPO variant which implements 1) modeling completion as a joint action and 2) RLOO loss, which does not use a value network
    • I copied this file directly from ppov2_trainer.py, so feel free to do a file diff to see the changes(e.g., the following diff shows how the RLOO loss is implemented)
image

Two more examples showing how they work with dummy reward models

  • examples/scripts/minimal/ppo.py: preliminary experiment shows RLHF reward goes up, so from an optimization standpoint it works as intended
image
  • examples/scripts/minimal/ppo_bandit_rloo.py: preliminary experiment shows RLHF reward goes up, so from an optimization standpoint it works as intended; though the KL kind of exploded, so we may need to use a larger beta for stronger regularization.
image

@vwxyzjn vwxyzjn requested a review from lewtun April 15, 2024 16:06
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this contribution. I've been hoping to experiment with REINFORCE on transformers for a while now, but didn't have the time to roll my own implementation.

This is a great foundation in terms of functionality. I'll be playing around with it soon.

I think we should reduce the repetition, and use inheritance of existing classes so we can take advantage of the great infrastructure built out by huggingface/transformers and huggingface/trl.

Happy to help if you're interested in collaborating, let me know.

trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo_bandit_rloo_large.py Outdated Show resolved Hide resolved
Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic work on adding these new RL trainers @vwxyzjn ! I've left some high level feedback on the RLOO trainer for now and will do a more fine-grained review when we've iterated a bit on the design.

Overall looks super clean

examples/scripts/minimal/ppo.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo_bandit_rloo.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo_bandit_rloo.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Apr 23, 2024

Hi @lapp0 thanks for the review! Will look into these comments more closely. I started running some experiments and noticed the KL of RLOO was orders of magnitude higher than that of the new PPO trainer. Not exactly sure the reason but will further investigate.

image

The PPO / Vanilla PG actually seems quite stable now with RLHF reward going up and model gets good scores and reasonable completions. There were some implementation details I found particularly helpful such as truncate at EOS token (i.e., --truncate_token eos), and I suspect the same technique could work nicely with RLOO.

Right now I am a bit focused on a zephyr PPO /Vanilla PG recipe for these couple of days, and will look into RLOO right after.

image image image

Copy link

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a WIP fork. The main difference is that it lets transformers.Trainer set everything up: batch creation, accelerate / deepspeed, etc. Instead it overrides training_step.

https://github.com/lapp0/trl/blob/onpolicy/trl/trainer/rloo_trainer.py

The main behavior difference is that it generates once per batch and runs for num_train_epochs rather than generating once per update and running for num_train_epochs * num_updates. Have you experimented with updating once per batch, and if so, does this harm stability? Is it important that I retain the ability to update once and run multiple epochs based on the model outputs from the start of the generation?

It's possible that generating once per batch instead of per update would improve KL now that you mention it.

trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_trainer.py Show resolved Hide resolved
@lapp0 lapp0 mentioned this pull request Apr 25, 2024
4 tasks
Copy link
Contributor Author

@vwxyzjn vwxyzjn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lapp0 @lewtun thanks so much for the review! I put some comments down and TODO items.

examples/scripts/minimal/ppo.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo.py Outdated Show resolved Hide resolved
examples/scripts/minimal/ppo.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_bandit_rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_trainer.py Show resolved Hide resolved
@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented Apr 25, 2024

After some refactoring / bug fixes, the new RLOO also seems much more stable. Will report when having newer results.
image

trl/trainer/ppov2_trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this huge work ! I left some comments ! I think the new classes should be also exposed in TRL's main init - LMK wdyt about my suggestions below 🙏

Comment on lines 123 to 156
def masked_mean(values, mask, axis=None):
"""Compute mean of tensor with a masked values."""
if axis is not None:
return (values * mask).sum(axis=axis) / mask.sum(axis=axis)
else:
return (values * mask).sum() / mask.sum()


def masked_var(values, mask, unbiased=True):
"""Compute variance of tensor with masked values."""
mean = masked_mean(values, mask)
centered_values = values - mean
variance = masked_mean(centered_values**2, mask)
if unbiased:
mask_sum = mask.sum()
if mask_sum == 0:
raise ValueError(
"The sum of the mask is zero, which can happen when `mini_batch_size=1`;"
"try increase the `mini_batch_size` or `gradient_accumulation_steps`"
)
# note that if mask_sum == 1, then there is a division by zero issue
# to avoid it you just need to use a larger minibatch_size
bessel_correction = mask_sum / (mask_sum - 1)
variance = variance * bessel_correction
return variance


def masked_whiten(values, mask, shift_mean=True):
"""Whiten values with masked values."""
mean, var = masked_mean(values, mask), masked_var(values, mask, False)
whitened = (values - mean) * torch.rsqrt(var + 1e-8)
if not shift_mean:
whitened += mean
return whitened
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those look the same as in:

trl/trl/core.py

Line 152 in 3b4c249

def masked_mean(values: torch.Tensor, mask: torch.Tensor, axis: Optional[bool] = None) -> torch.Tensor:
- can't you re-use them from trl.core?

return whitened


def get_reward(model, query_responses, tokenizer, context_length):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this method to trl.core?


def get_reward(model, query_responses, tokenizer, context_length):
attention_mask = query_responses != tokenizer.pad_token_id
# position_ids = attention_mask.cumsum(1) - attention_mask.long() # exclusive cumsum
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# position_ids = attention_mask.cumsum(1) - attention_mask.long() # exclusive cumsum

trl/trainer/rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/rloo_trainer.py Show resolved Hide resolved
trl/trainer/rloo_trainer.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move all ppo-related minimal scripts under a new ppo/ dir and rloo under rloo/ dir ? What do you think?

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this epic PR @vwxyzjn ! Overall it's looking quite close to being finished and I think the main remaining points to address are splitting off the configs into their own modules and seeing if we can hide config variables like world_size from the end user

parser = HfArgumentParser((PPOConfig, ModelConfig))
config, model_config = parser.parse_args_into_dataclasses()
# remove output_dir if exists
shutil.rmtree(config.output_dir, ignore_errors=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI you can set overwrite_output_dir in PPOConfig (via TrainingArguments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a quick test but it does not seem to remove the output_dir.

image

A quick search shows that the removing logic seems no longer there.

image

@@ -150,7 +150,7 @@ def unwrap_model_for_generation(
if accelerator.state.deepspeed_plugin is not None and accelerator.state.deepspeed_plugin.zero_stage == 3:
with deepspeed.zero.GatheredParameters(model.parameters()):
remove_hooks(model)
yield model
yield accelerator.unwrap_model(model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe models wrapped with the DeepSpeedEngine can still generate, so I'm curious why this is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it causes issues for RLOO, having some errors like "DeepSpeedEngine has not attribute generate`, so we still need to unwrap it.

"""Whether to use deepspeed to train the model"""

# various batch sizes
world_size: Optional[int] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but why do we only seem to need this for the RL trainers and not the other ones like SFTTrainer? In general, I'd like to avoid exposing this distributed stuff to the user if we can because it might not be clear if they should set the value manually or let accelerate handle it for them

trl/trainer/ppov2_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_trainer.py Outdated Show resolved Hide resolved
trl/trainer/ppov2_trainer.py Show resolved Hide resolved
trl/trainer/ppov2_trainer.py Outdated Show resolved Hide resolved
trl/trainer/rloo_trainer.py Outdated Show resolved Hide resolved
trl/trainer/rloo_trainer.py Outdated Show resolved Hide resolved
@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented May 15, 2024

Thank you @lewtun @younesbelkada @lapp0 for the review. I have addressed most of the concerns and also added some docs and benchmarks. Let me know if there is anything else needed :D

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge work ! Thanks @vwxyzjn ! Good for me to merge once @lewtun is happy about the latest changes + CI is green ! 🚀

@lapp0
Copy link

lapp0 commented May 18, 2024

Great work @vwxyzjn! Really impressed with the vLLM integration along with the other components you've introduced here.

I'll be working on a follow-up PR for quantized training using ppo_v2 once Unsloth's numerical stability issue is resolved, and hopefully incorporate a few structural changes as well, so I don't have any further comments on structure right now.

Did any of your RLOO runs result in improved benchmarks or at least improved score metrics? I was able to reproduce improving scores with ppov2 my refactor of your branch with BnB / peft support, but I never managed to do the same with RLOO.

PPOV2 metrics:

image

@vwxyzjn
Copy link
Contributor Author

vwxyzjn commented May 21, 2024

@lapp0 Very nice to hear your great results with PPOv2 and peft! I was able to get 1B RLOO good results on tl;dr summarization. See https://moon-ci-docs.huggingface.co/docs/trl/pr_1540/en/rloo_trainer#benchmark-experiments.

@vwxyzjn vwxyzjn merged commit 13454d2 into huggingface:main May 22, 2024
9 checks passed
@lapp0
Copy link

lapp0 commented May 22, 2024

Great work @vwxyzjn really exciting research and implementation you have put together. Feel free to ping me on any other PRs.

qgallouedec added a commit that referenced this pull request Jul 15, 2024
commit 9e9dc96
Author: Maxim Kopecki <[email protected]>
Date:   Wed Jul 10 19:11:13 2024 +0200

    Added missing token kwarg in Peft model loading (#1825)

commit 7ddef5c
Author: Quentin Gallouédec <[email protected]>
Date:   Wed Jul 10 18:26:11 2024 +0200

    Make use of `trust_remote_code` consistent (#1806)

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit a9cddf8
Author: Adnan Khan <[email protected]>
Date:   Wed Jul 10 11:25:07 2024 -0400

    Delete unused benchmark.yml workflow. (#1822)

commit 2860ce5
Author: Quentin Gallouédec <[email protected]>
Date:   Tue Jul 9 09:22:52 2024 +0200

    DPO Llava 1.5 and PaliGemma support (#1797)

    * llava support dpo

    * add_special_tokens=False only when possible

    * format

    * pali gemma

    * refactor size

    * remove image resize

    ---------

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit 30e33bd
Author: Quentin Gallouédec <[email protected]>
Date:   Tue Jul 9 05:37:12 2024 +0200

    upgrade gh actions (#1818)

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit d5a0d2d
Author: Costa Huang <[email protected]>
Date:   Mon Jul 8 11:12:41 2024 -0400

    Set dev version (#1817)

commit 314e8eb
Author: Puneet Singh Bhooi <[email protected]>
Date:   Mon Jul 8 19:11:36 2024 +0530

    fix broken url in `docs\source\index.mdx` (#1813)

commit e107920
Author: Costa Huang <[email protected]>
Date:   Mon Jul 8 09:38:09 2024 -0400

    0.9.6 release (#1816)

commit 78045de
Author: Alvaro Bartolome <[email protected]>
Date:   Mon Jul 8 01:59:26 2024 +0200

    Fix `TRL_USE_RICH` environment variable handling (#1808)

    * Add `strtobool` custom implementation from `distutils`

    * Fix `TRL_USE_RICH` handling via `strtobool`

    * Run `make precommit`

commit 747612f
Author: Alvaro Bartolome <[email protected]>
Date:   Fri Jul 5 16:28:59 2024 +0200

    Fix `torch_dtype` handling in `{DPO,SFT}Trainer` when provided via CLI (#1807)

    * Fix `torch_dtype` handling through CLI

    The `torch_dtype` is not properly handled when provided via the TRL CLI
    since it's provided initially as a string, but is then casted to
    `torch.dtype` before providing it to the `{DPO,SFT}Trainer`, which means
    that those trainers should handle the scenario where `torch_dtype` is a
    `torch.dtype` too.

    * Add `torch_dtype` tests in `test_{dpo,sft}_trainer.py`

    * Forward contribution credits

    * Run `make precommit`

    ---------

    Co-authored-by: Tash Srivastava <[email protected]>

commit 9e3a35b
Author: Michael <[email protected]>
Date:   Fri Jul 5 07:29:48 2024 -0400

    Remove extra print in reward_trainer.py (#1799)

    `print_rich_table` is called twice and the first call doesn't restrict to `num_print_samples`. Remove the first, extra call

commit 4402b36
Author: Quentin Gallouédec <[email protected]>
Date:   Thu Jul 4 14:29:25 2024 +0200

    clean examples (#1791)

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit 78f8228
Author: Noah Tye <[email protected]>
Date:   Wed Jul 3 11:10:50 2024 -0700

    Bugfix: Preserve token fields when converting TrainingArguments to SFTConfig (#1794)

    * Preserve token fields when converting TrainingArguments to SFTConfig

    TrainingArguments.to_dict() redacts token fields, so we have to
    individually copy them over when converting to SFTConfig to avoid
    breaking push_to_hub functionality.

    Also adds a test.

    * run precommit

    * one-line args_as_dict definition per suggestion from kashif

    * generalize token copying to match TrainingArguments behavior

    * unwrap |= on dict, to support python 3.8

    * use .update instead of |= or for-loop

commit b6af2ed
Author: Kashif Rasul <[email protected]>
Date:   Wed Jul 3 08:29:16 2024 +0200

    add model_init_kwargs to training_args (#1787)

commit cd85b14
Author: Tommaso Buonocore <[email protected]>
Date:   Sat Jun 29 15:35:48 2024 +0200

    Fixed typo in SFT trainer docs (#1788)

    'STFConfig' instead of 'SFTConfig' appears multiple times in the doc, causing error when running the code snippets.

commit a57544f
Author: Kashif Rasul <[email protected]>
Date:   Thu Jun 27 15:47:58 2024 +0200

    fix docs and examples (#1780)

commit b68ff96
Author: Quentin Gallouédec <[email protected]>
Date:   Wed Jun 26 16:26:37 2024 +0200

    Visual DPO (#1647)

    * Remove extra whitespaces

    * idefics

    * vdpo

    * sft idefics

    * pad with test

    * use prompt instead of tokenizer

    * rm name main

    * support vlm in tokenize row

    * temp fix for regex in lora_target_module

    * format

    * vdpo

    * tmp float16 hard code

    * concatenated_forward support for vision

    * style and new command line

    * all-linear

    * format

    * delete old examples

    * get image

    * upcast

    * new test

    * modified test

    * new strat for tokenizer

    * rm token transfer

    * integrate vision in dpo example

    * format

    * add FDivergenceType back

    * precommit

    * pillow test dep

    * optional prompt

    * `evaluation_strategy` to `eval_strategy`

    * revert vsft change (oos)

    * update test

    * test

    * comment and support more in process

    * update process

    * update doc for vdpo

    * caution about limited support

    * Update docs/source/dpo_trainer.mdx

    Co-authored-by: Kashif Rasul <[email protected]>

    * revert DPO example changes

    * cleaner way to check if a model is vision

    * comment

    * update vdpo example

    * rename

    ---------

    Co-authored-by: Quentin Gallouédec <[email protected]>
    Co-authored-by: Kashif Rasul <[email protected]>

commit c8c01cc
Author: Mubin Manasia <[email protected]>
Date:   Wed Jun 26 03:23:36 2024 -0600

    Fix Documentation Overflow Issues for Long URLs in SFTConfig (#1774)

    * Update sft_config.py

    * Update sft_config.py

commit 3479606
Author: Costa Huang <[email protected]>
Date:   Wed Jun 26 03:18:22 2024 -0400

    Remove the leading space in the tldr preference dataset (#1773)

commit 7965b78
Author: Haozhe Ji <[email protected]>
Date:   Tue Jun 25 22:47:32 2024 +0800

    add Efficient Exact Optimization (EXO) (#1735)

    * add exo

    * fix a detail

    * Update trl/trainer/dpo_trainer.py

    * Update trl/trainer/dpo_trainer.py

    * Update trl/trainer/dpo_trainer.py

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 56bd1bb
Author: Quentin Gallouédec <[email protected]>
Date:   Tue Jun 25 16:14:26 2024 +0200

    `evaluation_strategy` to `eval_strategy` (#1771)

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit 94d53e6
Author: Clara Pohland <[email protected]>
Date:   Mon Jun 24 21:27:00 2024 +0200

    MoE Models: option to add load balancing loss (#1765)

    * KTO: add aux loss

    * use router_aux_loss_coef in KtoTrainer when aux_loss enabled

    * align optional aux_loss in DPO, KTO, CPO, ORPO

    * precommit changes

    * fix KL forward kwargs

    * add aux_loss doku entry

    * apply docs suggestions

    ---------

    Co-authored-by: Clara Luise Pohland <[email protected]>

commit b5be100
Author: Mihir Prabhudesai <[email protected]>
Date:   Mon Jun 24 12:05:44 2024 -0400

    Added Reward Backpropogation Support  (#1585)

    * added alignprop template

    * added alignprop support

    * Update alignprop_trainer.mdx

    * Update alignprop_trainer.mdx

    * added better why statement

    * fixed inference code

    * changed self to pipeline

    * removed aesthetic classifier

    * added aesthetic to auxiliary models

    * added unseen prompt logging

    * removed unseen prompt log

    * fixed minor

    * remove not needed import in trl/__init__.py

    Co-authored-by: Younes Belkada <[email protected]>

    * fixed styling

    * updated _toctree

    ---------

    Co-authored-by: Younes Belkada <[email protected]>

commit 6e1652b
Author: Haoran Xu <[email protected]>
Date:   Sun Jun 23 09:54:30 2024 -0700

    Add CPO-SimPO method (#1760)

    * enable cpo-simpo

    * highlight SimPO and CPO-SimPO

    * add test for cpo_alpha

    * formatting

    * Update docs/source/cpo_trainer.mdx

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 65374c6
Author: Costa Huang <[email protected]>
Date:   Fri Jun 21 11:20:54 2024 -0400

    New sentiment and descriptiveness dataset (#1757)

    * push changes

    * handle edge cases where the chosen and the rejected are the same

commit 9956091
Author: Juyoung Suk <[email protected]>
Date:   Fri Jun 21 18:01:08 2024 +0900

    Add dataset_text_field in examples/scripts/sft.py (#1758)

commit 34d273f
Author: Costa Huang <[email protected]>
Date:   Thu Jun 20 13:16:43 2024 -0400

    Support num_train_epochs (#1743)

    * add a test case for num_train_epochs

    * fix ci

    * quick change

    * disable push to hub

    * debug windows ci

    * try another fix

    * skip subprocess tests on windows

commit 3bf9449
Author: Mert Sayar <[email protected]>
Date:   Thu Jun 20 18:22:20 2024 +0300

    Fix masking of response tokens (#1718)

    Current handling of `response_masks` inside `batch_forward_pass`
    function does not take padding into consideration which results with
    shape unmatch during masking. Since response mask is a mask tensor of
    response tokens, response tokens should not be concatenated with a
    `torch.zeros(query_length)` and masking operation should be done without
    slicing.

    Remove the concatenation of the response mask, remove the slicing from
    the response mask since response mask already has the length of `end -
    start + 1`, which is equal to length of `masks[j, start:end]`.

commit ba6abee
Author: idanshen <[email protected]>
Date:   Thu Jun 20 09:14:16 2024 -0400

    Support for returning past_key_values from the model (#1742)

    * add support for returning past_key_values from the model

    * change order of  keys

commit a57e759
Author: 1485840691 <[email protected]>
Date:   Wed Jun 19 18:02:51 2024 +0800

    Integrate f-divergence to DPO (Follow up) (#1610)

    * Step 1: update ppo_trainer and hello_world example

    * Step 2: Refine comments and add parameter type

    * Step 2: Add missing parameter comments

    * Step 1: Organize ptx loss into a function and add ptx_loss to train_stats

    * Step 1 updates: add comment to ptx_loss function, fix a bug and add warning message

    * Step 2: 1) Add ppo_ptx trainig example as ppo; 2) separate pretrain data fetch and iterate

    * Step 2: Remove loss from columns_to_log in ppo_ptx example

    * Remove data set revision in load imbd dataset

    * Run pre-commit and fix format issues

    * Initial draft of f-divergence fn

    * Update f-divergence to avoid overflow

    * fix test errors and comments

    * Add Unit tests for dpo loss with alpha and js div f

    * Adjust format

    * Fix test error

    * Reverse this update

    * Add test cases

    * Reverse un-needed updates

    * Update code style

    * Try to fix code fmt error

    * remove extra end line

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit ae23d40
Author: Shihyueh Hsu <[email protected]>
Date:   Tue Jun 18 22:07:24 2024 +0800

    change the `process` function in the example of DPO (#1753)

    * change the `process` function in the example of DPO

    * fix

commit 83b367b
Author: Younes Belkada <[email protected]>
Date:   Tue Jun 18 11:31:17 2024 +0200

    CI / `KTOTrainer`: Remove old tests (#1750)

    * remove old tests

    * remove datasets

    * Update test_dpo_trainer.py

    * Update test_dpo_trainer.py

commit d1ed730
Author: Michael <[email protected]>
Date:   Mon Jun 17 10:50:21 2024 -0400

    prepare deepspeed accomodate fp16 and bf16 (#1728)

    * prepare deepspeed accomodate fp16 and bf16

    * precommit

commit 8f8e95e
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 16:49:00 2024 +0200

    CPO / DPO: Fix red CI (#1749)

    * fix red CI

    * precommit

commit 4e23d95
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 16:41:36 2024 +0200

    fix red CI

commit 50c4620
Author: Kawin <[email protected]>
Date:   Mon Jun 17 07:14:44 2024 -0700

    small KTO fixes (#1734)

    * add warning for imbalanced data

    * update documentation

    * update script commands to be same as in dpo

    * use batch_size KL examples and batch_size target examples to calculate batch_size losses

    * fix deepspeed issue

    * speed up forward with no_grad for KL

    * add some removed metrics

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    add reference to paper

    Co-authored-by: lewtun <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * add more detailed comments

    * convert assert to ValueError

    * Update kto_trainer.py

    * precommit formatting

    * remove nans in metrics by gathering across machines

    * fix formatting

    * fix choice of mismatched examples for KL term

    * describe weights

    * fix hanging issue in distributed training

    * linting

    * move metrics to cpu

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: lewtun <[email protected]>

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    * remove kto_pair

    * speed up data processing

    * move bco code inside

    * raise error for kto_pair argument

    * fix formatting

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>
    Co-authored-by: lewtun <[email protected]>
    Co-authored-by: Winnie Xu <[email protected]>

commit 6105d03
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 16:01:06 2024 +0200

    `TrlParser`: Add ignore extra args option (#1748)

    * add ignore extra args option

    * Update trl/commands/cli_utils.py

commit e247bbd
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 15:16:07 2024 +0200

    CI / core: Pin `numpy` to `!=2.0.0` for CI and to users (#1747)

    * Update setup.py

    * Update setup.py

    * Update setup.py

    * Update test_best_of_n_sampler.py

    dummy commit

    * pin numpy

    * Update tests/test_best_of_n_sampler.py

    * Update setup.py

commit 3d04496
Author: Michael <[email protected]>
Date:   Mon Jun 17 08:43:33 2024 -0400

    better trl parser with yaml config (#1739)

    * working trl parser with config

    correctly overrides yaml config with command line arguments
    adds return_remaining_strings
    when return_remaining_strings is False, raises error if yaml contains
    extra args that are not in the dataclasses
    simpler and cleaner than previous yaml parsing and merging
    addresses #1733

    * lowercase trlparser

commit 2d244f8
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 11:56:13 2024 +0200

    Workflow: Notify tests results on slack channel (#1744)

    * Update tests-main.yml

    * Update docker-build.yml

commit f5168fd
Author: Igor Melnyk <[email protected]>
Date:   Wed Jun 12 05:54:54 2024 -0400

    adds AOT (#1701)

    * adds AOT

    * Applied format changes

    * added docs and tests

    ---------

    Co-authored-by: Igor Melnyk <[email protected]>

commit 79686e1
Author: jetlime <[email protected]>
Date:   Wed Jun 12 00:35:31 2024 +1000

    ktotrainer: Refuse datasets which contain only one class of labels (#1724)

    * ktotrainer: refuse dataset which contain only one class of labels

    * ktotrainer: document new dataset constraint

commit 34ebc4c
Author: Luc Georges <[email protected]>
Date:   Mon Jun 10 11:17:54 2024 +0200

    feat(ci): add trufflehog secrets detection (#1721)

    * feat(ci): add trufflehog secrets detection

    * fix(ci): remove unnecessary permissions

commit 1d84e2b
Author: Michael <[email protected]>
Date:   Fri Jun 7 11:42:08 2024 +0200

    Fix default padding_value in dpo_config.py (#1692)

    dpo_config default padding value should be None, not 0, otherwise it by default overrides the padding value of any tokenizer to 0

commit 2f71b8b
Author: Michael <[email protected]>
Date:   Fri Jun 7 10:37:27 2024 +0200

    fix yaml parser for derived config classes (#1713)

    fixes #1712
    reformatted cli_utils with ruff

commit 5bcb8ad
Author: Kashif Rasul <[email protected]>
Date:   Fri Jun 7 08:48:17 2024 +0100

    RDPO fix nll loss (#1705)

commit b8b972f
Author: Haoran Xu <[email protected]>
Date:   Thu Jun 6 14:06:47 2024 -0700

    Add a variant of CPO, SimPO (#1703)

    * add a variant of cpo: simpo

    * correct cpo-simpo loss

    * avoid 0 int error in logging

    * add simpo description

    * Update trl/trainer/cpo_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * fix formatting

    * add test for simpo

    * Update docs/source/cpo_trainer.mdx

    Co-authored-by: Kashif Rasul <[email protected]>

    * add a docstring for simpogamma

    * move simpo description to the above docstring

    * change simpo description in the doc

    * formatting

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 3eb9ccb
Author: Younes Belkada <[email protected]>
Date:   Thu Jun 6 19:33:20 2024 +0200

    set dev version (#1710)

    * Update setup.py

    * Update __init__.py

commit 974b0d3
Author: Costa Huang <[email protected]>
Date:   Thu Jun 6 10:13:00 2024 -0400

    0.9.4 release (#1708)

commit 39a7d1c
Author: Younes Belkada <[email protected]>
Date:   Thu Jun 6 15:50:17 2024 +0200

    SFTTrainer: Fix backward Compatibility issue with `TrainingArguments` (#1707)

    * fix BC

    * fixup

commit 0bdc638
Author: Guilherme Freire <[email protected]>
Date:   Thu Jun 6 14:42:58 2024 +0100

    Fixed doc string and docs for the SFTConfig update (#1706)

commit 275d33b
Author: Costa Huang <[email protected]>
Date:   Wed Jun 5 14:34:59 2024 -0400

    0.9.3 release (#1699)

commit c0819ee
Author: Younes Belkada <[email protected]>
Date:   Wed Jun 5 17:29:03 2024 +0200

    Update sft_trainer.py (#1698)

commit a03e7cc
Author: Costa Huang <[email protected]>
Date:   Wed Jun 5 11:00:19 2024 -0400

    Release 0.9.2 (#1697)

    * Release: 0.9.0

    * Release

commit a13cb89
Author: Costa Huang <[email protected]>
Date:   Wed Jun 5 10:20:54 2024 -0400

    Quick fix on GPT4-eval (#1696)

    * quick fix

    * precommit

commit 84156f1
Author: Quentin Gallouédec <[email protected]>
Date:   Mon Jun 3 20:09:05 2024 +0200

    Fix typo in DPOTrainer's warnings (#1688)

commit 4eb0b90
Author: Alex Brooks <[email protected]>
Date:   Mon Jun 3 10:24:32 2024 -0600

    Skip packing validation (#1673)

    * Add test for skipping preproc if packing=True

    Signed-off-by: Alex-Brooks <[email protected]>

    * Allow skipping of validation for packing=True

    Signed-off-by: Alex-Brooks <[email protected]>

    * Use dummy dataset in no packing preproc test

    Signed-off-by: Alex-Brooks <[email protected]>

    ---------

    Signed-off-by: Alex-Brooks <[email protected]>

commit 6c203f9
Author: Alexey Rozhkov <[email protected]>
Date:   Mon Jun 3 10:16:22 2024 +0100

    Fix overriding optimize_device_cache with optimize_cuda_cache in PPOConfig (#1690)

    * Don't override optimize_device_cache when optimize_cuda_cache is not provided
    Raise an exception when both optimize_cuda_cache and optimize_device_cache are set

    * Minor fix

commit f18253b
Author: Kashif Rasul <[email protected]>
Date:   Mon Jun 3 09:43:02 2024 +0100

    intial RPO loss (#1686)

    * intial RPO loss

    * fix sign

    * clean up

commit 151a452
Author: Samuel <[email protected]>
Date:   Wed May 29 20:29:38 2024 +0200

    Fix max completion length (#1588)

commit 488b502
Author: Younes Belkada <[email protected]>
Date:   Wed May 29 20:19:26 2024 +0200

    fix (#1678)

commit 3c0a10b
Author: Wang, Yi <[email protected]>
Date:   Mon May 27 20:52:20 2024 +0800

    fix dataset load error (#1670)

    Signed-off-by: Wang, Yi <[email protected]>

commit b031adf
Author: Younes Belkada <[email protected]>
Date:   Fri May 24 15:20:16 2024 +0200

    FIX / PPO: Fix `enable_input_require_grads` issues with PPO models (#1664)

    * Update modeling_base.py

    * Update ppo_config.py

    * Update ppo_trainer.py

    * style

commit e7cb597
Author: Costa Huang <[email protected]>
Date:   Thu May 23 11:37:16 2024 -0400

    Fix ppov2 test case (#1661)

    * Fix PPOv2 / RLOO refactor's stuff

    * update terminology to use stop token

commit bc8dfbf
Author: Kashif Rasul <[email protected]>
Date:   Thu May 23 15:28:04 2024 +0200

    update eval_strategy (#1662)

commit e4ed7a3
Author: Sourab Mangrulkar <[email protected]>
Date:   Thu May 23 18:34:22 2024 +0530

    do not upcast adapters when using FSDP+QLoRA (#1654)

commit 9a7efbd
Author: syrn1k <[email protected]>
Date:   Thu May 23 15:58:49 2024 +0300

    🤫 TR-DPO implementation (#1593)

    * 🤫 TR-DPO implementation baseline

    * fix comments

    * docs

    * fix linters

    * test added

    * move configs to DPOConfig

    * fix typo

    * add docs

    * fix import

    * use state.global_step

    * fix order of arguments

    * make sure plugins are not none

    * Update trl/trainer/utils.py

    Co-authored-by: Benjamin Bossan <[email protected]>

    * Update trl/trainer/utils.py

    Co-authored-by: Benjamin Bossan <[email protected]>

    * checking that reference model weights have changed

    * sync_target_model as staticmethod

    * set reference model

    ---------

    Co-authored-by: Nikita Surnachev <[email protected]>
    Co-authored-by: Kashif Rasul <[email protected]>
    Co-authored-by: Benjamin Bossan <[email protected]>

commit b344bce
Author: Anush Kini <[email protected]>
Date:   Thu May 23 18:27:25 2024 +0530

    [DPO] Add 'robust' loss_type (#1653)

    * Initial commit

    * pre-commit fix

    * Minor change to comments

    * Added some documentation on how to use Robust DPO

commit 35e12dc
Author: Nicolinho <[email protected]>
Date:   Thu May 23 14:36:15 2024 +0200

    Fix inheritance order in PPOv2Config (#1659)

    * fix inheritance order in PPOv2Config

    * fix inheritance order in rloo_config

commit 1da6be1
Author: Ali Bakly <[email protected]>
Date:   Thu May 23 14:10:29 2024 +0200

    docs: correct cDPO usage in DPOTrainer (#1655)

commit e249cd8
Author: Younes Belkada <[email protected]>
Date:   Thu May 23 14:10:05 2024 +0200

    add support for training collator (#1658)

commit a02513c
Author: Zach Mueller <[email protected]>
Date:   Thu May 23 06:48:00 2024 -0400

    Apply deprecated `evaluation_strategy` (#1559)

    * Deprecate

    * Update tests/test_dpo_trainer.py

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 13454d2
Author: Costa Huang <[email protected]>
Date:   Wed May 22 08:31:10 2024 -0400

    PPO / Reinforce Trainers (#1540)

    * Add ppov2 trainer

    * make eos trick optional, remove unused args

    * quick fix

    * precommit

    * update debugging script

    * fix out of bound `drop_last=True`; use built-in scheduler

    * Add PPO examples

    * push changes

    * quick change

    * quick change

    * various bug fixes

    * remove unnecessary grad accumulation setting

    * push new changes

    * fix DS3 model saving

    * update ppo.py

    * refactor

    * quick change

    * refactor

    * update ppo trainer

    * refactor

    * quick test

    * add ds2 /ds3 7 processes config

    * add vllm trainer

    * quick change

    * experiment with reward normalization

    * push changes

    * quick push

    * push changes

    * push various changes

    * refactor to use ModelConfig

    * quick change

    * refactor

    * refactor

    * Simplify DS logic

    * quick update

    * remove unnecessary files

    * precommit

    * deepspeed fix; handle edge case when eos_token_id = 0

    * add PPO tldr example

    * add TL;DR example

    * fix undefined var

    * utilize all samples in rloo

    * quick setting

    * remove the unnecessary `value_model`

    * use exact_div

    * allow saving the deepspeed model

    * refactor

    * remove dead code

    * Use some shared utilities

    * add some end-to-end test cases

    * add PPOv2 docs and RLOO docs / tests

    * update docs

    * quikc push

    * fix ci

    * fix type annotation for ci

    * quick update

    * update trainer docs

commit 99f2c94
Author: Sourab Mangrulkar <[email protected]>
Date:   Wed May 15 19:55:46 2024 +0530

    don't cast the trainable lora layers to half precision (#1644)

    * don't cast the trainable lora layers to half precision

    * quality

commit 6401d08
Author: Wing Lian <[email protected]>
Date:   Tue May 14 09:41:07 2024 -0400

    Pairwise Noise Contrastive Alignment (#1632)

    * add NCA paired preference loss

    * chore: lint

    * set more lenient tolerance for integration tests

    * Update tests/test_dpo_trainer.py

    * skip test

    * fix

    ---------

    Co-authored-by: Younes Belkada <[email protected]>
    Co-authored-by: younesbelkada <[email protected]>

commit d632a5b
Author: bartoszzuk <[email protected]>
Date:   Tue May 14 12:25:54 2024 +0200

    Fixed wrong logs prefixes in KTOTrainer (#1641)

    * Fixed wrong logs prefixes in KTOTrainer

    * Pre-commit formating

commit 5aeb752
Author: Tiezhen WANG <[email protected]>
Date:   Fri May 10 23:19:15 2024 +0800

    Update sft_llama2.py to work with the latest API (#1637)

    * Update sft_llama2.py to work with the latest API

    SFTTrainer now takes a STFConfig argument

    * Update dpo_llama2.py

    * precommit

commit b8b8978
Author: Ilya Gusev <[email protected]>
Date:   Fri May 10 15:43:13 2024 +0200

    [ORPO] Correct label mask for pad tokens (#1625)

    * [ORPO] Correct label mask for pad tokens

    Recent [fix](57aebe9) for calculating NLL loss for a whole sequence introduced a bug. When input_ids are copied to labels, pad tokens are not masked.

    This PR aims to path this by masking labels based on the attention mask.

    * -100 -> label_pad_token_id

    Co-authored-by: Kashif Rasul <[email protected]>

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 8799952
Author: Costa Huang <[email protected]>
Date:   Fri May 10 09:32:20 2024 -0400

    visualize rm prediction (#1636)

    * visualize rm prediction

    * quick update

    * quick check

    * quick fix

    * update eval steps

commit 3b4c249
Author: Xiao Yu <[email protected]>
Date:   Fri May 3 18:19:35 2024 -0400

    fixed adding bos and eos token unconditionally (#1591)

    * fixed adding bos and eos token unconditionally

    * fixed typo of tokenizer -> self.tokenizer. Also added update to ORPO

    * fixed code quality, and added BOS/EOS fix to KTO

    * code reformatting with pre-commit run --all-files

    * bug fix: check input id length before checking for EOS/BOS

commit 0347f58
Author: lewtun <[email protected]>
Date:   Fri May 3 15:59:59 2024 +0200

    Fix ZeRO-3 generation context manager (#1617)
qgallouedec added a commit that referenced this pull request Jul 18, 2024
* Add WinRateCallback

* Enable PairRM

* Refactor

* Streamline

* Add HF judge

* Add base judge

* Use better prompt

* Clean

* Add max tokens

* Use logging

* Add batched inference

* Squashed commit of the following:

commit 9e9dc96
Author: Maxim Kopecki <[email protected]>
Date:   Wed Jul 10 19:11:13 2024 +0200

    Added missing token kwarg in Peft model loading (#1825)

commit 7ddef5c
Author: Quentin Gallouédec <[email protected]>
Date:   Wed Jul 10 18:26:11 2024 +0200

    Make use of `trust_remote_code` consistent (#1806)

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit a9cddf8
Author: Adnan Khan <[email protected]>
Date:   Wed Jul 10 11:25:07 2024 -0400

    Delete unused benchmark.yml workflow. (#1822)

commit 2860ce5
Author: Quentin Gallouédec <[email protected]>
Date:   Tue Jul 9 09:22:52 2024 +0200

    DPO Llava 1.5 and PaliGemma support (#1797)

    * llava support dpo

    * add_special_tokens=False only when possible

    * format

    * pali gemma

    * refactor size

    * remove image resize

    ---------

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit 30e33bd
Author: Quentin Gallouédec <[email protected]>
Date:   Tue Jul 9 05:37:12 2024 +0200

    upgrade gh actions (#1818)

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit d5a0d2d
Author: Costa Huang <[email protected]>
Date:   Mon Jul 8 11:12:41 2024 -0400

    Set dev version (#1817)

commit 314e8eb
Author: Puneet Singh Bhooi <[email protected]>
Date:   Mon Jul 8 19:11:36 2024 +0530

    fix broken url in `docs\source\index.mdx` (#1813)

commit e107920
Author: Costa Huang <[email protected]>
Date:   Mon Jul 8 09:38:09 2024 -0400

    0.9.6 release (#1816)

commit 78045de
Author: Alvaro Bartolome <[email protected]>
Date:   Mon Jul 8 01:59:26 2024 +0200

    Fix `TRL_USE_RICH` environment variable handling (#1808)

    * Add `strtobool` custom implementation from `distutils`

    * Fix `TRL_USE_RICH` handling via `strtobool`

    * Run `make precommit`

commit 747612f
Author: Alvaro Bartolome <[email protected]>
Date:   Fri Jul 5 16:28:59 2024 +0200

    Fix `torch_dtype` handling in `{DPO,SFT}Trainer` when provided via CLI (#1807)

    * Fix `torch_dtype` handling through CLI

    The `torch_dtype` is not properly handled when provided via the TRL CLI
    since it's provided initially as a string, but is then casted to
    `torch.dtype` before providing it to the `{DPO,SFT}Trainer`, which means
    that those trainers should handle the scenario where `torch_dtype` is a
    `torch.dtype` too.

    * Add `torch_dtype` tests in `test_{dpo,sft}_trainer.py`

    * Forward contribution credits

    * Run `make precommit`

    ---------

    Co-authored-by: Tash Srivastava <[email protected]>

commit 9e3a35b
Author: Michael <[email protected]>
Date:   Fri Jul 5 07:29:48 2024 -0400

    Remove extra print in reward_trainer.py (#1799)

    `print_rich_table` is called twice and the first call doesn't restrict to `num_print_samples`. Remove the first, extra call

commit 4402b36
Author: Quentin Gallouédec <[email protected]>
Date:   Thu Jul 4 14:29:25 2024 +0200

    clean examples (#1791)

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit 78f8228
Author: Noah Tye <[email protected]>
Date:   Wed Jul 3 11:10:50 2024 -0700

    Bugfix: Preserve token fields when converting TrainingArguments to SFTConfig (#1794)

    * Preserve token fields when converting TrainingArguments to SFTConfig

    TrainingArguments.to_dict() redacts token fields, so we have to
    individually copy them over when converting to SFTConfig to avoid
    breaking push_to_hub functionality.

    Also adds a test.

    * run precommit

    * one-line args_as_dict definition per suggestion from kashif

    * generalize token copying to match TrainingArguments behavior

    * unwrap |= on dict, to support python 3.8

    * use .update instead of |= or for-loop

commit b6af2ed
Author: Kashif Rasul <[email protected]>
Date:   Wed Jul 3 08:29:16 2024 +0200

    add model_init_kwargs to training_args (#1787)

commit cd85b14
Author: Tommaso Buonocore <[email protected]>
Date:   Sat Jun 29 15:35:48 2024 +0200

    Fixed typo in SFT trainer docs (#1788)

    'STFConfig' instead of 'SFTConfig' appears multiple times in the doc, causing error when running the code snippets.

commit a57544f
Author: Kashif Rasul <[email protected]>
Date:   Thu Jun 27 15:47:58 2024 +0200

    fix docs and examples (#1780)

commit b68ff96
Author: Quentin Gallouédec <[email protected]>
Date:   Wed Jun 26 16:26:37 2024 +0200

    Visual DPO (#1647)

    * Remove extra whitespaces

    * idefics

    * vdpo

    * sft idefics

    * pad with test

    * use prompt instead of tokenizer

    * rm name main

    * support vlm in tokenize row

    * temp fix for regex in lora_target_module

    * format

    * vdpo

    * tmp float16 hard code

    * concatenated_forward support for vision

    * style and new command line

    * all-linear

    * format

    * delete old examples

    * get image

    * upcast

    * new test

    * modified test

    * new strat for tokenizer

    * rm token transfer

    * integrate vision in dpo example

    * format

    * add FDivergenceType back

    * precommit

    * pillow test dep

    * optional prompt

    * `evaluation_strategy` to `eval_strategy`

    * revert vsft change (oos)

    * update test

    * test

    * comment and support more in process

    * update process

    * update doc for vdpo

    * caution about limited support

    * Update docs/source/dpo_trainer.mdx

    Co-authored-by: Kashif Rasul <[email protected]>

    * revert DPO example changes

    * cleaner way to check if a model is vision

    * comment

    * update vdpo example

    * rename

    ---------

    Co-authored-by: Quentin Gallouédec <[email protected]>
    Co-authored-by: Kashif Rasul <[email protected]>

commit c8c01cc
Author: Mubin Manasia <[email protected]>
Date:   Wed Jun 26 03:23:36 2024 -0600

    Fix Documentation Overflow Issues for Long URLs in SFTConfig (#1774)

    * Update sft_config.py

    * Update sft_config.py

commit 3479606
Author: Costa Huang <[email protected]>
Date:   Wed Jun 26 03:18:22 2024 -0400

    Remove the leading space in the tldr preference dataset (#1773)

commit 7965b78
Author: Haozhe Ji <[email protected]>
Date:   Tue Jun 25 22:47:32 2024 +0800

    add Efficient Exact Optimization (EXO) (#1735)

    * add exo

    * fix a detail

    * Update trl/trainer/dpo_trainer.py

    * Update trl/trainer/dpo_trainer.py

    * Update trl/trainer/dpo_trainer.py

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 56bd1bb
Author: Quentin Gallouédec <[email protected]>
Date:   Tue Jun 25 16:14:26 2024 +0200

    `evaluation_strategy` to `eval_strategy` (#1771)

    Co-authored-by: Quentin Gallouédec <[email protected]>

commit 94d53e6
Author: Clara Pohland <[email protected]>
Date:   Mon Jun 24 21:27:00 2024 +0200

    MoE Models: option to add load balancing loss (#1765)

    * KTO: add aux loss

    * use router_aux_loss_coef in KtoTrainer when aux_loss enabled

    * align optional aux_loss in DPO, KTO, CPO, ORPO

    * precommit changes

    * fix KL forward kwargs

    * add aux_loss doku entry

    * apply docs suggestions

    ---------

    Co-authored-by: Clara Luise Pohland <[email protected]>

commit b5be100
Author: Mihir Prabhudesai <[email protected]>
Date:   Mon Jun 24 12:05:44 2024 -0400

    Added Reward Backpropogation Support  (#1585)

    * added alignprop template

    * added alignprop support

    * Update alignprop_trainer.mdx

    * Update alignprop_trainer.mdx

    * added better why statement

    * fixed inference code

    * changed self to pipeline

    * removed aesthetic classifier

    * added aesthetic to auxiliary models

    * added unseen prompt logging

    * removed unseen prompt log

    * fixed minor

    * remove not needed import in trl/__init__.py

    Co-authored-by: Younes Belkada <[email protected]>

    * fixed styling

    * updated _toctree

    ---------

    Co-authored-by: Younes Belkada <[email protected]>

commit 6e1652b
Author: Haoran Xu <[email protected]>
Date:   Sun Jun 23 09:54:30 2024 -0700

    Add CPO-SimPO method (#1760)

    * enable cpo-simpo

    * highlight SimPO and CPO-SimPO

    * add test for cpo_alpha

    * formatting

    * Update docs/source/cpo_trainer.mdx

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 65374c6
Author: Costa Huang <[email protected]>
Date:   Fri Jun 21 11:20:54 2024 -0400

    New sentiment and descriptiveness dataset (#1757)

    * push changes

    * handle edge cases where the chosen and the rejected are the same

commit 9956091
Author: Juyoung Suk <[email protected]>
Date:   Fri Jun 21 18:01:08 2024 +0900

    Add dataset_text_field in examples/scripts/sft.py (#1758)

commit 34d273f
Author: Costa Huang <[email protected]>
Date:   Thu Jun 20 13:16:43 2024 -0400

    Support num_train_epochs (#1743)

    * add a test case for num_train_epochs

    * fix ci

    * quick change

    * disable push to hub

    * debug windows ci

    * try another fix

    * skip subprocess tests on windows

commit 3bf9449
Author: Mert Sayar <[email protected]>
Date:   Thu Jun 20 18:22:20 2024 +0300

    Fix masking of response tokens (#1718)

    Current handling of `response_masks` inside `batch_forward_pass`
    function does not take padding into consideration which results with
    shape unmatch during masking. Since response mask is a mask tensor of
    response tokens, response tokens should not be concatenated with a
    `torch.zeros(query_length)` and masking operation should be done without
    slicing.

    Remove the concatenation of the response mask, remove the slicing from
    the response mask since response mask already has the length of `end -
    start + 1`, which is equal to length of `masks[j, start:end]`.

commit ba6abee
Author: idanshen <[email protected]>
Date:   Thu Jun 20 09:14:16 2024 -0400

    Support for returning past_key_values from the model (#1742)

    * add support for returning past_key_values from the model

    * change order of  keys

commit a57e759
Author: 1485840691 <[email protected]>
Date:   Wed Jun 19 18:02:51 2024 +0800

    Integrate f-divergence to DPO (Follow up) (#1610)

    * Step 1: update ppo_trainer and hello_world example

    * Step 2: Refine comments and add parameter type

    * Step 2: Add missing parameter comments

    * Step 1: Organize ptx loss into a function and add ptx_loss to train_stats

    * Step 1 updates: add comment to ptx_loss function, fix a bug and add warning message

    * Step 2: 1) Add ppo_ptx trainig example as ppo; 2) separate pretrain data fetch and iterate

    * Step 2: Remove loss from columns_to_log in ppo_ptx example

    * Remove data set revision in load imbd dataset

    * Run pre-commit and fix format issues

    * Initial draft of f-divergence fn

    * Update f-divergence to avoid overflow

    * fix test errors and comments

    * Add Unit tests for dpo loss with alpha and js div f

    * Adjust format

    * Fix test error

    * Reverse this update

    * Add test cases

    * Reverse un-needed updates

    * Update code style

    * Try to fix code fmt error

    * remove extra end line

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit ae23d40
Author: Shihyueh Hsu <[email protected]>
Date:   Tue Jun 18 22:07:24 2024 +0800

    change the `process` function in the example of DPO (#1753)

    * change the `process` function in the example of DPO

    * fix

commit 83b367b
Author: Younes Belkada <[email protected]>
Date:   Tue Jun 18 11:31:17 2024 +0200

    CI / `KTOTrainer`: Remove old tests (#1750)

    * remove old tests

    * remove datasets

    * Update test_dpo_trainer.py

    * Update test_dpo_trainer.py

commit d1ed730
Author: Michael <[email protected]>
Date:   Mon Jun 17 10:50:21 2024 -0400

    prepare deepspeed accomodate fp16 and bf16 (#1728)

    * prepare deepspeed accomodate fp16 and bf16

    * precommit

commit 8f8e95e
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 16:49:00 2024 +0200

    CPO / DPO: Fix red CI (#1749)

    * fix red CI

    * precommit

commit 4e23d95
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 16:41:36 2024 +0200

    fix red CI

commit 50c4620
Author: Kawin <[email protected]>
Date:   Mon Jun 17 07:14:44 2024 -0700

    small KTO fixes (#1734)

    * add warning for imbalanced data

    * update documentation

    * update script commands to be same as in dpo

    * use batch_size KL examples and batch_size target examples to calculate batch_size losses

    * fix deepspeed issue

    * speed up forward with no_grad for KL

    * add some removed metrics

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    add reference to paper

    Co-authored-by: lewtun <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * add more detailed comments

    * convert assert to ValueError

    * Update kto_trainer.py

    * precommit formatting

    * remove nans in metrics by gathering across machines

    * fix formatting

    * fix choice of mismatched examples for KL term

    * describe weights

    * fix hanging issue in distributed training

    * linting

    * move metrics to cpu

    * Update trl/trainer/kto_trainer.py

    Co-authored-by: lewtun <[email protected]>

    * Update trl/trainer/kto_trainer.py

    * Update trl/trainer/kto_trainer.py

    * remove kto_pair

    * speed up data processing

    * move bco code inside

    * raise error for kto_pair argument

    * fix formatting

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>
    Co-authored-by: lewtun <[email protected]>
    Co-authored-by: Winnie Xu <[email protected]>

commit 6105d03
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 16:01:06 2024 +0200

    `TrlParser`: Add ignore extra args option (#1748)

    * add ignore extra args option

    * Update trl/commands/cli_utils.py

commit e247bbd
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 15:16:07 2024 +0200

    CI / core: Pin `numpy` to `!=2.0.0` for CI and to users (#1747)

    * Update setup.py

    * Update setup.py

    * Update setup.py

    * Update test_best_of_n_sampler.py

    dummy commit

    * pin numpy

    * Update tests/test_best_of_n_sampler.py

    * Update setup.py

commit 3d04496
Author: Michael <[email protected]>
Date:   Mon Jun 17 08:43:33 2024 -0400

    better trl parser with yaml config (#1739)

    * working trl parser with config

    correctly overrides yaml config with command line arguments
    adds return_remaining_strings
    when return_remaining_strings is False, raises error if yaml contains
    extra args that are not in the dataclasses
    simpler and cleaner than previous yaml parsing and merging
    addresses #1733

    * lowercase trlparser

commit 2d244f8
Author: Younes Belkada <[email protected]>
Date:   Mon Jun 17 11:56:13 2024 +0200

    Workflow: Notify tests results on slack channel (#1744)

    * Update tests-main.yml

    * Update docker-build.yml

commit f5168fd
Author: Igor Melnyk <[email protected]>
Date:   Wed Jun 12 05:54:54 2024 -0400

    adds AOT (#1701)

    * adds AOT

    * Applied format changes

    * added docs and tests

    ---------

    Co-authored-by: Igor Melnyk <[email protected]>

commit 79686e1
Author: jetlime <[email protected]>
Date:   Wed Jun 12 00:35:31 2024 +1000

    ktotrainer: Refuse datasets which contain only one class of labels (#1724)

    * ktotrainer: refuse dataset which contain only one class of labels

    * ktotrainer: document new dataset constraint

commit 34ebc4c
Author: Luc Georges <[email protected]>
Date:   Mon Jun 10 11:17:54 2024 +0200

    feat(ci): add trufflehog secrets detection (#1721)

    * feat(ci): add trufflehog secrets detection

    * fix(ci): remove unnecessary permissions

commit 1d84e2b
Author: Michael <[email protected]>
Date:   Fri Jun 7 11:42:08 2024 +0200

    Fix default padding_value in dpo_config.py (#1692)

    dpo_config default padding value should be None, not 0, otherwise it by default overrides the padding value of any tokenizer to 0

commit 2f71b8b
Author: Michael <[email protected]>
Date:   Fri Jun 7 10:37:27 2024 +0200

    fix yaml parser for derived config classes (#1713)

    fixes #1712
    reformatted cli_utils with ruff

commit 5bcb8ad
Author: Kashif Rasul <[email protected]>
Date:   Fri Jun 7 08:48:17 2024 +0100

    RDPO fix nll loss (#1705)

commit b8b972f
Author: Haoran Xu <[email protected]>
Date:   Thu Jun 6 14:06:47 2024 -0700

    Add a variant of CPO, SimPO (#1703)

    * add a variant of cpo: simpo

    * correct cpo-simpo loss

    * avoid 0 int error in logging

    * add simpo description

    * Update trl/trainer/cpo_trainer.py

    Co-authored-by: Kashif Rasul <[email protected]>

    * fix formatting

    * add test for simpo

    * Update docs/source/cpo_trainer.mdx

    Co-authored-by: Kashif Rasul <[email protected]>

    * add a docstring for simpogamma

    * move simpo description to the above docstring

    * change simpo description in the doc

    * formatting

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 3eb9ccb
Author: Younes Belkada <[email protected]>
Date:   Thu Jun 6 19:33:20 2024 +0200

    set dev version (#1710)

    * Update setup.py

    * Update __init__.py

commit 974b0d3
Author: Costa Huang <[email protected]>
Date:   Thu Jun 6 10:13:00 2024 -0400

    0.9.4 release (#1708)

commit 39a7d1c
Author: Younes Belkada <[email protected]>
Date:   Thu Jun 6 15:50:17 2024 +0200

    SFTTrainer: Fix backward Compatibility issue with `TrainingArguments` (#1707)

    * fix BC

    * fixup

commit 0bdc638
Author: Guilherme Freire <[email protected]>
Date:   Thu Jun 6 14:42:58 2024 +0100

    Fixed doc string and docs for the SFTConfig update (#1706)

commit 275d33b
Author: Costa Huang <[email protected]>
Date:   Wed Jun 5 14:34:59 2024 -0400

    0.9.3 release (#1699)

commit c0819ee
Author: Younes Belkada <[email protected]>
Date:   Wed Jun 5 17:29:03 2024 +0200

    Update sft_trainer.py (#1698)

commit a03e7cc
Author: Costa Huang <[email protected]>
Date:   Wed Jun 5 11:00:19 2024 -0400

    Release 0.9.2 (#1697)

    * Release: 0.9.0

    * Release

commit a13cb89
Author: Costa Huang <[email protected]>
Date:   Wed Jun 5 10:20:54 2024 -0400

    Quick fix on GPT4-eval (#1696)

    * quick fix

    * precommit

commit 84156f1
Author: Quentin Gallouédec <[email protected]>
Date:   Mon Jun 3 20:09:05 2024 +0200

    Fix typo in DPOTrainer's warnings (#1688)

commit 4eb0b90
Author: Alex Brooks <[email protected]>
Date:   Mon Jun 3 10:24:32 2024 -0600

    Skip packing validation (#1673)

    * Add test for skipping preproc if packing=True

    Signed-off-by: Alex-Brooks <[email protected]>

    * Allow skipping of validation for packing=True

    Signed-off-by: Alex-Brooks <[email protected]>

    * Use dummy dataset in no packing preproc test

    Signed-off-by: Alex-Brooks <[email protected]>

    ---------

    Signed-off-by: Alex-Brooks <[email protected]>

commit 6c203f9
Author: Alexey Rozhkov <[email protected]>
Date:   Mon Jun 3 10:16:22 2024 +0100

    Fix overriding optimize_device_cache with optimize_cuda_cache in PPOConfig (#1690)

    * Don't override optimize_device_cache when optimize_cuda_cache is not provided
    Raise an exception when both optimize_cuda_cache and optimize_device_cache are set

    * Minor fix

commit f18253b
Author: Kashif Rasul <[email protected]>
Date:   Mon Jun 3 09:43:02 2024 +0100

    intial RPO loss (#1686)

    * intial RPO loss

    * fix sign

    * clean up

commit 151a452
Author: Samuel <[email protected]>
Date:   Wed May 29 20:29:38 2024 +0200

    Fix max completion length (#1588)

commit 488b502
Author: Younes Belkada <[email protected]>
Date:   Wed May 29 20:19:26 2024 +0200

    fix (#1678)

commit 3c0a10b
Author: Wang, Yi <[email protected]>
Date:   Mon May 27 20:52:20 2024 +0800

    fix dataset load error (#1670)

    Signed-off-by: Wang, Yi <[email protected]>

commit b031adf
Author: Younes Belkada <[email protected]>
Date:   Fri May 24 15:20:16 2024 +0200

    FIX / PPO: Fix `enable_input_require_grads` issues with PPO models (#1664)

    * Update modeling_base.py

    * Update ppo_config.py

    * Update ppo_trainer.py

    * style

commit e7cb597
Author: Costa Huang <[email protected]>
Date:   Thu May 23 11:37:16 2024 -0400

    Fix ppov2 test case (#1661)

    * Fix PPOv2 / RLOO refactor's stuff

    * update terminology to use stop token

commit bc8dfbf
Author: Kashif Rasul <[email protected]>
Date:   Thu May 23 15:28:04 2024 +0200

    update eval_strategy (#1662)

commit e4ed7a3
Author: Sourab Mangrulkar <[email protected]>
Date:   Thu May 23 18:34:22 2024 +0530

    do not upcast adapters when using FSDP+QLoRA (#1654)

commit 9a7efbd
Author: syrn1k <[email protected]>
Date:   Thu May 23 15:58:49 2024 +0300

    🤫 TR-DPO implementation (#1593)

    * 🤫 TR-DPO implementation baseline

    * fix comments

    * docs

    * fix linters

    * test added

    * move configs to DPOConfig

    * fix typo

    * add docs

    * fix import

    * use state.global_step

    * fix order of arguments

    * make sure plugins are not none

    * Update trl/trainer/utils.py

    Co-authored-by: Benjamin Bossan <[email protected]>

    * Update trl/trainer/utils.py

    Co-authored-by: Benjamin Bossan <[email protected]>

    * checking that reference model weights have changed

    * sync_target_model as staticmethod

    * set reference model

    ---------

    Co-authored-by: Nikita Surnachev <[email protected]>
    Co-authored-by: Kashif Rasul <[email protected]>
    Co-authored-by: Benjamin Bossan <[email protected]>

commit b344bce
Author: Anush Kini <[email protected]>
Date:   Thu May 23 18:27:25 2024 +0530

    [DPO] Add 'robust' loss_type (#1653)

    * Initial commit

    * pre-commit fix

    * Minor change to comments

    * Added some documentation on how to use Robust DPO

commit 35e12dc
Author: Nicolinho <[email protected]>
Date:   Thu May 23 14:36:15 2024 +0200

    Fix inheritance order in PPOv2Config (#1659)

    * fix inheritance order in PPOv2Config

    * fix inheritance order in rloo_config

commit 1da6be1
Author: Ali Bakly <[email protected]>
Date:   Thu May 23 14:10:29 2024 +0200

    docs: correct cDPO usage in DPOTrainer (#1655)

commit e249cd8
Author: Younes Belkada <[email protected]>
Date:   Thu May 23 14:10:05 2024 +0200

    add support for training collator (#1658)

commit a02513c
Author: Zach Mueller <[email protected]>
Date:   Thu May 23 06:48:00 2024 -0400

    Apply deprecated `evaluation_strategy` (#1559)

    * Deprecate

    * Update tests/test_dpo_trainer.py

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 13454d2
Author: Costa Huang <[email protected]>
Date:   Wed May 22 08:31:10 2024 -0400

    PPO / Reinforce Trainers (#1540)

    * Add ppov2 trainer

    * make eos trick optional, remove unused args

    * quick fix

    * precommit

    * update debugging script

    * fix out of bound `drop_last=True`; use built-in scheduler

    * Add PPO examples

    * push changes

    * quick change

    * quick change

    * various bug fixes

    * remove unnecessary grad accumulation setting

    * push new changes

    * fix DS3 model saving

    * update ppo.py

    * refactor

    * quick change

    * refactor

    * update ppo trainer

    * refactor

    * quick test

    * add ds2 /ds3 7 processes config

    * add vllm trainer

    * quick change

    * experiment with reward normalization

    * push changes

    * quick push

    * push changes

    * push various changes

    * refactor to use ModelConfig

    * quick change

    * refactor

    * refactor

    * Simplify DS logic

    * quick update

    * remove unnecessary files

    * precommit

    * deepspeed fix; handle edge case when eos_token_id = 0

    * add PPO tldr example

    * add TL;DR example

    * fix undefined var

    * utilize all samples in rloo

    * quick setting

    * remove the unnecessary `value_model`

    * use exact_div

    * allow saving the deepspeed model

    * refactor

    * remove dead code

    * Use some shared utilities

    * add some end-to-end test cases

    * add PPOv2 docs and RLOO docs / tests

    * update docs

    * quikc push

    * fix ci

    * fix type annotation for ci

    * quick update

    * update trainer docs

commit 99f2c94
Author: Sourab Mangrulkar <[email protected]>
Date:   Wed May 15 19:55:46 2024 +0530

    don't cast the trainable lora layers to half precision (#1644)

    * don't cast the trainable lora layers to half precision

    * quality

commit 6401d08
Author: Wing Lian <[email protected]>
Date:   Tue May 14 09:41:07 2024 -0400

    Pairwise Noise Contrastive Alignment (#1632)

    * add NCA paired preference loss

    * chore: lint

    * set more lenient tolerance for integration tests

    * Update tests/test_dpo_trainer.py

    * skip test

    * fix

    ---------

    Co-authored-by: Younes Belkada <[email protected]>
    Co-authored-by: younesbelkada <[email protected]>

commit d632a5b
Author: bartoszzuk <[email protected]>
Date:   Tue May 14 12:25:54 2024 +0200

    Fixed wrong logs prefixes in KTOTrainer (#1641)

    * Fixed wrong logs prefixes in KTOTrainer

    * Pre-commit formating

commit 5aeb752
Author: Tiezhen WANG <[email protected]>
Date:   Fri May 10 23:19:15 2024 +0800

    Update sft_llama2.py to work with the latest API (#1637)

    * Update sft_llama2.py to work with the latest API

    SFTTrainer now takes a STFConfig argument

    * Update dpo_llama2.py

    * precommit

commit b8b8978
Author: Ilya Gusev <[email protected]>
Date:   Fri May 10 15:43:13 2024 +0200

    [ORPO] Correct label mask for pad tokens (#1625)

    * [ORPO] Correct label mask for pad tokens

    Recent [fix](57aebe9) for calculating NLL loss for a whole sequence introduced a bug. When input_ids are copied to labels, pad tokens are not masked.

    This PR aims to path this by masking labels based on the attention mask.

    * -100 -> label_pad_token_id

    Co-authored-by: Kashif Rasul <[email protected]>

    ---------

    Co-authored-by: Kashif Rasul <[email protected]>

commit 8799952
Author: Costa Huang <[email protected]>
Date:   Fri May 10 09:32:20 2024 -0400

    visualize rm prediction (#1636)

    * visualize rm prediction

    * quick update

    * quick check

    * quick fix

    * update eval steps

commit 3b4c249
Author: Xiao Yu <[email protected]>
Date:   Fri May 3 18:19:35 2024 -0400

    fixed adding bos and eos token unconditionally (#1591)

    * fixed adding bos and eos token unconditionally

    * fixed typo of tokenizer -> self.tokenizer. Also added update to ORPO

    * fixed code quality, and added BOS/EOS fix to KTO

    * code reformatting with pre-commit run --all-files

    * bug fix: check input id length before checking for EOS/BOS

commit 0347f58
Author: lewtun <[email protected]>
Date:   Fri May 3 15:59:59 2024 +0200

    Fix ZeRO-3 generation context manager (#1617)

* judge refactoring and unittest

* format

* init

* doc

* format

* improve doc

* basejudge

* improve doc and add BaseAPIJudge

* Doc

* style

* refactor callback

* remove openai and pairrm judge from test

* doc

* rm dpo online example

* new prompts and completions

* skip hf judge and add hf token

---------

Co-authored-by: Quentin Gallouédec <[email protected]>
Co-authored-by: Quentin Gallouédec <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants