using training data more efficiently #788

Majdoddin · 2024-12-07T11:38:02Z

Majdoddin
Dec 7, 2024

So llm.c breaks the text into a batch (B) for input.
The target is the same text, shifted right, forming the next-token prediction task.

So samples of a batch look like this:
[x₀], [x₀, x₁], ..., [x₀, ..., xₜ₋₁] → [x₁, x₂, ..., xₜ]
[xₜ], [xₜ, xₜ₊₁], ..., [xₜ, ..., x₂ₜ₋₁] → [xₜ₊₁, xₜ₊₂, ..., x₂ₜ]
...
where t is is context length. Once processed, the pointer advances by B ⋅ t for the next batch.

My suggestion Instead of advancing the pointer by B ⋅ t, slide it by 1 (an overlapping window) to generate additional batches. So the samples of the next batch look like:
[x₁], [x₁, x₂], ..., [x₁, ..., xₜ] → [x₂, x₃, ..., xₜ₊₁]
[xₜ₊₁], [xₜ₊₁, xₜ₊₂], ..., [xₜ₊₁, ..., x₂ₜ] → [xₜ₊₂, xₜ₊₃, ..., x₂ₜ₊₁]
...

Why does this matter?
From B ⋅ t tokens, the number of training samples with k input tokens increases from just B to roughly B ⋅ t
This is an improvement factor of T 💪

Show here Andrej explain the current code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using training data more efficiently #788

{{title}}

Replies: 0 comments

Select a reply

using training data more efficiently #788

Majdoddin Dec 7, 2024

Replies: 0 comments

Majdoddin
Dec 7, 2024