You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, when the context becomes full, we pick part of the tokens and recompute the KV cache.
Instead, try to either:
store non-RoPEd KV cache, "shift" it when the context is full and compute the RoPE over the entire cache for every new token taking into account the current positions
store RoPEd KV cache (as we do now), "shift" it when the context is full and apply extra shift-RoPE on it (assuming RoPE is "additive")
The text was updated successfully, but these errors were encountered:
Storing non-RoPEd KV cache would allow us to implement dynamic NTK or YaRN RoPE scaling, which is the state-of-the-art for context scaling on non-finetuned models. See section 3.3 of this paper.
Currently, when the context becomes full, we pick part of the tokens and recompute the KV cache.
Instead, try to either:
The text was updated successfully, but these errors were encountered: