You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So llm.c breaks the text into a batch (B) for input.
The target is the same text, shifted right, forming the next-token prediction task.
So samples of a batch look like this:
[x₀], [x₀, x₁], ..., [x₀, ..., xₜ₋₁] → [x₁, x₂, ..., xₜ]
[xₜ], [xₜ, xₜ₊₁], ..., [xₜ, ..., x₂ₜ₋₁] → [xₜ₊₁, xₜ₊₂, ..., x₂ₜ]
...
where t is is context length. Once processed, the pointer advances by B ⋅ t for the next batch.
My suggestion Instead of advancing the pointer by B ⋅ t, slide it by 1 (an overlapping window) to generate additional batches. So the samples of the next batch look like:
[x₁], [x₁, x₂], ..., [x₁, ..., xₜ] → [x₂, x₃, ..., xₜ₊₁]
[xₜ₊₁], [xₜ₊₁, xₜ₊₂], ..., [xₜ₊₁, ..., x₂ₜ] → [xₜ₊₂, xₜ₊₃, ..., x₂ₜ₊₁]
...
Why does this matter?
From B ⋅ t tokens, the number of training samples with k input tokens increases from just B to roughly B ⋅ t
This is an improvement factor of T 💪
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
So llm.c breaks the text into a batch (B) for input.
The target is the same text, shifted right, forming the next-token prediction task.
So samples of a batch look like this:
[x₀], [x₀, x₁], ..., [x₀, ..., xₜ₋₁] → [x₁, x₂, ..., xₜ]
[xₜ], [xₜ, xₜ₊₁], ..., [xₜ, ..., x₂ₜ₋₁] → [xₜ₊₁, xₜ₊₂, ..., x₂ₜ]
...
where t is is context length. Once processed, the pointer advances by B ⋅ t for the next batch.
My suggestion Instead of advancing the pointer by B ⋅ t, slide it by 1 (an overlapping window) to generate additional batches. So the samples of the next batch look like:
[x₁], [x₁, x₂], ..., [x₁, ..., xₜ] → [x₂, x₃, ..., xₜ₊₁]
[xₜ₊₁], [xₜ₊₁, xₜ₊₂], ..., [xₜ₊₁, ..., x₂ₜ] → [xₜ₊₂, xₜ₊₃, ..., x₂ₜ₊₁]
...
Why does this matter?
From B ⋅ t tokens, the number of training samples with k input tokens increases from just B to roughly B ⋅ t
This is an improvement factor of T 💪
Show here Andrej explain the current code
Beta Was this translation helpful? Give feedback.
All reactions