You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We provide many wonderful examples of using torch.compile in our repo. Integrating compile into the RLHF recipe could significantly improve performance. It'd also be a unique selling point of the recipe.
In general, should we try and compile the largest chunks of code as possible so as to provide as much opportunity as possible for the compiler to optimize? If so, I'll perhaps re-order the suggestions below as 1) compile generation 2) compile whole-trajectory generation 3) compile loss step.
""" 1: Generate responses, and logits corresponding to the responses using the current policy, generating (query, response) pairs. 2. Estimate logprobs of the generated responses using the current policy. 3. Estimate values from the generated responses using the current value function. 4. Replace any tokens in the response after the first stop token (usually EOS token) with padding, producting truncated responses. 5. Run the reward model on the (query, truncated-response) pairs. 6. Mask out all the invalid values in the trajectory due to padding tokens."""
There are two options here:
Refactor step 1), the generation step, and use kv-cache generation with generate_next_token compiled. Then, we compile steps 2-6 in one function, separately.
Compile all steps 1-6 without compiling generate_next_token separately.
Note above: I think in gpt-fast both the prefill step, and generate_next_token are compiled, separately.
The plan (subject to above)
Let's completely disable compile in the recipe. Currently, compiling our policy and value models causes significant recompiles since tensors are switching between inference and training modes. Obtain a sensible benchmark here vs default compile settings.
We provide many wonderful examples of using
torch.compile
in our repo. Integrating compile into the RLHF recipe could significantly improve performance. It'd also be a unique selling point of the recipe.In general, should we try and compile the largest chunks of code as possible so as to provide as much opportunity as possible for the compiler to optimize? If so, I'll perhaps re-order the suggestions below as 1) compile generation 2) compile whole-trajectory generation 3) compile loss step.
Wr.t. trajectory generation: trajectory generation involves:
There are two options here:
generate_next_token
compiled. Then, we compile steps 2-6 in one function, separately.generate_next_token
separately.Note above: I think in gpt-fast both the prefill step, and
generate_next_token
are compiled, separately.The plan (subject to above)
The text was updated successfully, but these errors were encountered: