Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] PPO Performance Optimizations (or: PPOPO) #1425

Closed
SalmanMohammadi opened this issue Aug 28, 2024 · 0 comments · Fixed by #2066
Closed

[RFC] PPO Performance Optimizations (or: PPOPO) #1425

SalmanMohammadi opened this issue Aug 28, 2024 · 0 comments · Fixed by #2066

Comments

@SalmanMohammadi
Copy link
Collaborator

SalmanMohammadi commented Aug 28, 2024

We provide many wonderful examples of using torch.compile in our repo. Integrating compile into the RLHF recipe could significantly improve performance. It'd also be a unique selling point of the recipe.

In general, should we try and compile the largest chunks of code as possible so as to provide as much opportunity as possible for the compiler to optimize? If so, I'll perhaps re-order the suggestions below as 1) compile generation 2) compile whole-trajectory generation 3) compile loss step.

Wr.t. trajectory generation: trajectory generation involves:

"""
        1: Generate responses, and logits corresponding to the responses using the current policy,
            generating (query, response) pairs.
        2. Estimate logprobs of the generated responses using the current policy.
        3. Estimate values from the generated responses using the current value function.
        4. Replace any tokens in the response after the first stop token (usually EOS token) with padding,
            producting truncated responses.
        5. Run the reward model on the (query, truncated-response) pairs.
        6. Mask out all the invalid values in the trajectory due to padding tokens.
"""

There are two options here:

  1. Refactor step 1), the generation step, and use kv-cache generation with generate_next_token compiled. Then, we compile steps 2-6 in one function, separately.
  2. Compile all steps 1-6 without compiling generate_next_token separately.

Note above: I think in gpt-fast both the prefill step, and generate_next_token are compiled, separately.

The plan (subject to above)

  1. Let's completely disable compile in the recipe. Currently, compiling our policy and value models causes significant recompiles since tensors are switching between inference and training modes. Obtain a sensible benchmark here vs default compile settings.
  2. Integrate [RFC] Batched inference 🤝 KV-cache 🤝 compile #1424 into the recipe, and enable KV cache generation - compare against above (using whichever global compile config is the fastest).
  3. As above, but with compiled generation. 2 and 3 alone should provide order(s?) of magnitude speedup in trajectory generation.
  4. Compile loss step
  5. Compile the entire trajectory generation step?
@pytorch pytorch deleted a comment Aug 28, 2024
@SalmanMohammadi SalmanMohammadi changed the title [RFC] PPO 🤝 Compiled KV-cache-enabled generation 🤝 model compile [RFC] PPO Performance Optimizations (or: PPOPO) Aug 28, 2024
@SalmanMohammadi SalmanMohammadi mentioned this issue Nov 15, 2024
44 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant