[RFC] PPO Performance Optimizations (or: PPOPO) #1425

SalmanMohammadi · 2024-08-28T11:00:07Z

We provide many wonderful examples of using torch.compile in our repo. Integrating compile into the RLHF recipe could significantly improve performance. It'd also be a unique selling point of the recipe.

In general, should we try and compile the largest chunks of code as possible so as to provide as much opportunity as possible for the compiler to optimize? If so, I'll perhaps re-order the suggestions below as 1) compile generation 2) compile whole-trajectory generation 3) compile loss step.

Wr.t. trajectory generation: trajectory generation involves:

"""
        1: Generate responses, and logits corresponding to the responses using the current policy,
            generating (query, response) pairs.
        2. Estimate logprobs of the generated responses using the current policy.
        3. Estimate values from the generated responses using the current value function.
        4. Replace any tokens in the response after the first stop token (usually EOS token) with padding,
            producting truncated responses.
        5. Run the reward model on the (query, truncated-response) pairs.
        6. Mask out all the invalid values in the trajectory due to padding tokens.
"""

There are two options here:

Refactor step 1), the generation step, and use kv-cache generation with generate_next_token compiled. Then, we compile steps 2-6 in one function, separately.
Compile all steps 1-6 without compiling generate_next_token separately.

Note above: I think in gpt-fast both the prefill step, and generate_next_token are compiled, separately.

The plan (subject to above)

Let's completely disable compile in the recipe. Currently, compiling our policy and value models causes significant recompiles since tensors are switching between inference and training modes. Obtain a sensible benchmark here vs default compile settings.
Integrate [RFC] Batched inference 🤝 KV-cache 🤝 compile #1424 into the recipe, and enable KV cache generation - compare against above (using whichever global compile config is the fastest).
As above, but with compiled generation. 2 and 3 alone should provide order(s?) of magnitude speedup in trajectory generation.
Compile loss step
Compile the entire trajectory generation step?

The text was updated successfully, but these errors were encountered:

SalmanMohammadi mentioned this issue Aug 28, 2024

[RFC] RLHF follow-ups #1395

Closed

8 tasks

pytorch deleted a comment Aug 28, 2024

SalmanMohammadi changed the title ~~[RFC] PPO 🤝 Compiled KV-cache-enabled generation 🤝 model compile~~ [RFC] PPO Performance Optimizations (or: PPOPO) Aug 28, 2024

SalmanMohammadi mentioned this issue Nov 15, 2024

v0.5.0 tracker #2008

Closed

44 tasks

SalmanMohammadi mentioned this issue Nov 25, 2024

PPO Performance Improvements #2066

Merged

RdoubleA closed this as completed in #2066 Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] PPO Performance Optimizations (or: PPOPO) #1425

[RFC] PPO Performance Optimizations (or: PPOPO) #1425

SalmanMohammadi commented Aug 28, 2024 •

edited

Loading

[RFC] PPO Performance Optimizations (or: PPOPO) #1425

[RFC] PPO Performance Optimizations (or: PPOPO) #1425

Comments

SalmanMohammadi commented Aug 28, 2024 • edited Loading

The plan (subject to above)

SalmanMohammadi commented Aug 28, 2024 •

edited

Loading