Allow loss masking for defined spans of characters #113

sohamparikh · 2025-01-14T19:28:09Z

✨ Description

Support loss masking for spans specified in the input data. This PR will ensure that loss will not be computed on the specified spans. The biggest use-case for this is instruction tuning data where we want to avoid training on the prompts.

Closes #109

📝 Changes

List the key changes introduced in this PR:

Support character spans as inputs specified in the prepare command
Read the spans during training and apply masks to cross-entropy loss

🔍 Type of change

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

fast_llm/data/dataset/gpt/memmap.py

jlamypoirier · 2025-01-15T16:50:22Z

Looks good so far, but can you please add a short description and/or point to an issue?

tscholak · 2025-01-24T22:55:17Z

fast_llm/data/preparator/gpt_memmap/prepare.py

+        for start, end in char_spans:
+            if char_pos < start:
+                curr_text = text[char_pos:start]
+                tokenized_text = self._tokenizer.tokenize(curr_text, add_special_tokens=beginning_of_text)


This works only for those tokenizers that only have a BOS but not a EOS token.
For those that come with both, can we control whether tokenize adds the BOS and EOS tokens independently? I'm worried that we are adding the EOS token at the end of the first segment and the BOS token at the beginning of the last segment.

good catch! I'll make it explicitly add BOS only for the first segment
Btw, most tokenizers (Llama-3.1, Mistral-Nemo-Base-2407, OLMoE-1B-7B-0924) do not add the EOS token with add_special_tokens=True. Does this mean we've been training the models without the EOS token?

In the future I think we should make this config driven. The default behaviour would be to add both BOS and EOS tokens. It's important for pretraining with attention mask, and especially for SFT.

Does this mean we've been training the models without the EOS token?

Indeed, we decided that adding both BOS and EOS tokens in pretraining was unnecessary, because they are redundant. Here though I think we need to add them because we need to teach the model to terminate a response with the EOS token so that generation can stop at the right moment. Btw, I think HF is not adding the EOS token by default because otherwise prompts would end with it.

tscholak · 2025-01-24T23:34:57Z

fast_llm/functional/cross_entropy.py

        exp_logits1 = exp_logits.scatter(
            1, target, exp_logits.gather(1, target) - target_mask * sum_exp_logits.unsqueeze(dim=-1)
        )
        exp_logits2 = exp_logits1.mul((grad_output / logits.size(0)) / sum_exp_logits.unsqueeze(dim=-1))
        if logits_scale_factor != 1.0:
            exp_logits2 *= logits_scale_factor

-        grad = exp_logits2.to(logits.dtype)
+        grad.index_put_((mask,), exp_logits2.to(logits.dtype))

    predicted_logits = (target_mask * logits_norm.gather(1, target)).squeeze(1)
    all_reduce(predicted_logits, op=ReduceOp.SUM, group=group)


does the triton implementation support masking?

I think it doesn't: https://github.com/ServiceNow/Fast-LLM/blob/soham/loss-masking-spans/fast_llm/functional/triton/cross_entropy.py
We need to add it. Since this is the same for all loss functions, it would make sense to implement it before dispatching to specialized cross-entropy implementations:

def cross_entropy_forward_backward( logits, target, grad_output: float | None, group: ProcessGroup | None, implementation: CrossEntropyImpl = CrossEntropyImpl.fused, logits_scale_factor: float = 1.0, ignore_index: int=-100, ) -> tuple[torch.Tensor, torch.Tensor | None]: ... mask = target != ignore_index target = target[mask] logits = logits[mask] ...

fast_llm/data/data/gpt/data.py

tscholak · 2025-01-28T18:35:57Z

fast_llm/data/dataset/gpt/fim.py


-        assert sample.shape[0] == sample_len
+        assert sample.ids.shape[0] == sample_len
        return sample


since this code will change the order of tokens in the sequence, we would need to change the masks accordingly to allow for FIM with loss masking.
At this point, I think we should not and fail if FIM was used with loss masking.

throwing an error statement in this function now

fast_llm/data/dataset/gpt/memmap.py

tscholak · 2025-01-28T19:00:52Z

fast_llm/data/dataset/gpt/random.py

+            end = np.random.RandomState(np_seed).randint(start, len(ids))
+            spans.append([start, end])
+            prev_end = end
+        return GPTSample(ids=ids, spans=np.array(spans, dtype=np.int32).reshape(-1, 2))


that's a nice addition, though I'm not sure if the random dataset actually needs spans... We use this only for benchmarking purposes to measure training performance without being IO bound.

I added it for the tests. Can make the spans empty if you think it'll mess with the benchmark numbers

Spans really need to be None here, though we could add a config parameter to generate random spans.

fast_llm/data/tokenizer.py

fast_llm/functional/cross_entropy.py

tscholak

Hi @sohamparikh, nice progress!
Is anything functional missing from this PR? Do the spans make it all the way to the loss function, and does packing work as expected with spans? Can we test this?

sohamparikh · 2025-01-28T20:47:10Z

Functionally it's good to go now.
I've tested prepare and train on SmolLM2-135M on a single GPU using a dummy dataset with spans. Seems to be working fine, including packing and the loss functions.

How do we want to test this? I can test it on a bigger model with multiple nodes if that makes sense

jlamypoirier

I reviewed the overall structure, looks good for the intended purpose but will need polishing. Main areas to focus on (see individual comments):

Spans need to be opt-in, so that there is a negligible impact when they are not used (which is most of the time).
Turning samples and friends into dataclasses is a good idea but goes a bit outside the present scope. Ideally it would go in a separate PR, but it's ok to include here if done well (see coments)
Names need to follow our style guide (https://servicenow.github.io/Fast-LLM/contributing/style-guide/) a bit better. Please use self-descriptive names as much as possible, ex. spans is a bit cryptic and could mean lots of things. (Not sure what to replace it with, but it should hint to loss masking in some way)

jlamypoirier · 2025-01-28T23:05:37Z

fast_llm/data/config.py

@@ -34,3 +41,8 @@ class TokenizerConfig(Config):
        desc="Path to the tokenizer file.",
        hint=FieldHint.core,
    )
+    special_tokens_mode: SpecialTokensMode = Field(


This doesn't look self-descriptive enough. Could we think of a better name, and improve the description? (also not sure what this does?)

jlamypoirier · 2025-01-28T23:58:58Z

fast_llm/data/dataset/gpt/sampled.py

@@ -21,12 +22,17 @@
 logger = logging.getLogger(__name__)


+@dataclasses.dataclass
+class GPTSample:


This seems like a nice addition, but will have an impact far beyond the scope of the current PR, so we need to do it carefully (same for other similar dataclasses):

Use meaningful field names (ids->token_ids, spans->?)

If we're going the dataclass way we should go for it all the way, i.e. inherit from a Sample base class and adjust type hints everywhere. Same for the batch thing. No need to do it in this PR, but if not we'll need an issue to refer to.

The custom model also needs to be adjusted since it inherit from GPT

jlamypoirier · 2025-01-28T23:59:35Z

fast_llm/data/dataset/gpt/memmap.py

+    spans: np.ndarray
+
+
+@dataclasses.dataclass


Redundant with GPTSample (and breaks type hint)

jlamypoirier · 2025-01-29T00:00:20Z

fast_llm/data/dataset/gpt/memmap.py

@@ -10,6 +11,18 @@
 from fast_llm.utils import Assert, div


+@dataclasses.dataclass
+class GPTMemmapDocument:


GPTDocument (nothing to do with memmap), also not sure it belong in this file.
Anyway isn't this also the same as GPTSample?

jlamypoirier · 2025-01-29T00:02:34Z

fast_llm/data/dataset/gpt/memmap.py

@@ -10,6 +11,18 @@
 from fast_llm.utils import Assert, div


+@dataclasses.dataclass
+class GPTMemmapDocument:
+    text: np.ndarray


Isn't this ids?

jlamypoirier · 2025-01-29T00:27:25Z

fast_llm/data/dataset/gpt/random.py

+            end = np.random.RandomState(np_seed).randint(start, len(ids))
+            spans.append([start, end])
+            prev_end = end
+        return GPTSample(ids=ids, spans=np.array(spans, dtype=np.int32).reshape(-1, 2))


Spans really need to be None here, though we could add a config parameter to generate random spans.

jlamypoirier · 2025-01-29T01:21:05Z

fast_llm/functional/cross_entropy.py

            logits, target, grad_output, logits_scale_factor=logits_scale_factor
        )
+    if grad_logits is not None:


This needs to go inside the implementation because each of them can be optimized in its own way. torch implementation has ignore_index already, compiled version can include this inside the compile block, and triton kernels can include masking. For the triton part you can keep this if you don't know triton (it's a really easy one though).

Also torch.where would do a better job here.

(And as usual, masking needs to be opt-in)

(And as usual, masking needs to be opt-in)

Do you mean an additional flag indicating whether loss masking should take place (using the config option for reading spans)?
I'm not clear why ignore_index isn't sufficient since it wouldn't be set without the spans config flag anyway

Yes. ignore_index isn't sufficient because it would slow things down when not in use.

jlamypoirier · 2025-01-29T01:22:10Z

fast_llm/data/data/gpt/data.py

@@ -23,6 +26,19 @@
 logger = logging.getLogger(__name__)


+@dataclasses.dataclass
+class GPTDataBatch:


GPTBatch is enough

jlamypoirier · 2025-01-29T01:28:08Z

tests/test_dataset.py

@@ -82,8 +83,8 @@ def get_test_data_and_samples(
    batch_config.setup(distributed_config)
    batch_config.validate()
    samples = {
-        phase: [batch[0] for batch in data.get_iterator(batch_config, phase, consumed_samples=0, num_workers=0)]
-        for phase, samples in samples_per_phase.items()
+        phase: list(data.get_iterator(batch_config, phase, consumed_samples=consumed_samples, num_workers=0))


This makes the existing tests too complicated. Instead, Please test spans with a small number of separate tests specifically targeting them. (Not sure we need full coverage for all cases, or you could make one complicated test case that indirectly test many classes.)

jlamypoirier · 2025-01-29T01:31:16Z

fast_llm/data/dataset/gpt/indexed.py

@@ -36,6 +36,9 @@ def get_document_sizes(self) -> np.ndarray:
        # TODO: This can be really big.
        return self._dataset.get_document_sizes()[self._begin : self._end]

+    def get_span_sizes(self) -> np.ndarray:


Is this only used in the tests? If so I'm not sure it's worth making a public method at this stage.
(And would need to be added to GPTIndexedDataset too)

sohamparikh added 3 commits January 14, 2025 19:26

convert character spans to token spans

9367fcd

handle null spans

515dcb5

handle spans in data iterator, fix test

3457ba2

tscholak reviewed Jan 15, 2025

View reviewed changes

fast_llm/data/dataset/gpt/memmap.py Outdated Show resolved Hide resolved

tscholak reviewed Jan 15, 2025

View reviewed changes

fast_llm/data/dataset/gpt/memmap.py Outdated Show resolved Hide resolved

sohamparikh added 3 commits January 16, 2025 18:31

bump dataset version

c7373b9

create a document class

0699e0f

make loss masking work for prepare and training

419acd7

sohamparikh changed the title ~~convert character spans to token spans~~ Allow loss masking for defined spans of characters Jan 24, 2025

merge main

acad1e4

tscholak reviewed Jan 24, 2025

View reviewed changes

sohamparikh added 8 commits January 25, 2025 00:20

bos and eos options for tokenizer

daa2ad7

loss masking for triton cross entropy

bb175bf

fix random data tests

0e7ad8b

revert precommit versions

989a8f8

fix memmap dataset test

9633f88

fix remaining dataset tests

4f955ff

Merge branch 'main' into soham/loss-masking-spans

70e40e8

compose tests

1ac5052

sohamparikh marked this pull request as ready for review January 28, 2025 08:19

sohamparikh marked this pull request as draft January 28, 2025 08:29

handle special tokens from config

aebb5a0