Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Rassibassi · 2025-01-15T14:26:10Z

Description

When using spacy-transformers with various HuggingFace models, I've discovered inconsistencies in the token alignment data (doc._.trf_data.align) between fast and slow tokenizer implementations. This issue particularly affects DeBERTa models and, to a lesser extent, RoBERTa-based models.

Key Observations

DeBERTa Models: The alignment IDs are being duplicated when using the fast tokenizer:
- Fast tokenizer produces pairs of duplicate IDs: (1, 2, 2, 3, 3, 4, 4, ...)
- Slow tokenizer produces sequential IDs: (1, 2, 3, 4, ...)
RoBERTa Models: Shows minor differences in alignment:
- Different align_data between fast/slow implementations
- Fast tokenizer: (4, 1, 1, 1, ..., 1, 1, 1, 1, 3, ...)
- Slow tokenizer: (4, 1, 1, 1, ..., 1, 0, 1, 1, 3, ...)

Verification

The issue appears to be specific to spacy-transformers, as direct usage of HuggingFace transformers shows no such discrepancies
The differences affect both alignment data and lengths

Reproduction Steps

Run the attached script which tests multiple models with both fast and slow tokenizer implementations
Compare the alignment data and lengths between fast/slow tokenizer variants
Note the systematic duplication in DeBERTa models and the alignment shifts in RoBERTa models

How to reproduce the behaviour

Run the following script:

import warnings

warnings.simplefilter("ignore")

import spacy
from rich import print
from transformers import AutoTokenizer

MODELS = [
    "distilroberta-base",
    # "roberta-base",
    # "intfloat/e5-small-v2",
    "BAAI/bge-small-en-v1.5",
    "microsoft/deberta-v3-xsmall",
    # "microsoft/deberta-v3-small",
    "microsoft/Multilingual-MiniLM-L12-H384",
    # "microsoft/deberta-v3-large",
]

model_max_length = 1024

text = """Copenhagen is the capital and most populous city of Denmark,
with a population of 1.4 million in the urban area."""

for model_name in MODELS:
    wordpieces_strings = []
    wordpieces_input_ids = []
    wordpieces_attention_mask = []

    model_output_last_hidden_state = []
    align_data = []
    align_lengths = []

    print(f"[bold blue]Model: {model_name}[/bold blue]")
    for use_fast in [True, False]:
        nlp = spacy.blank("en")
        config = {
            "model": {
                "@architectures": "spacy-transformers.TransformerModel.v3",
                "name": model_name,
                "tokenizer_config": {
                    "use_fast": use_fast,
                    "model_max_length": model_max_length,
                },
                "get_spans": {"@span_getters": "spacy-transformers.doc_spans.v1"},
            },
        }
        nlp.add_pipe("transformer", config=config)
        nlp.initialize()

        # tokenizer = nlp.get_pipe("transformer").model.tokenizer
        # print(f"[bold blue]Tokenizer: {type(tokenizer)}[/bold blue]")

        doc = nlp(text)

        wordpieces_strings.append(doc._.trf_data.wordpieces.strings[0])
        wordpieces_input_ids.append(
            tuple(doc._.trf_data.wordpieces.input_ids[0].tolist())
        )
        wordpieces_attention_mask.append(
            tuple(doc._.trf_data.wordpieces.attention_mask[0].tolist())
        )

        model_output_last_hidden_state.append(
            doc._.trf_data.model_output["last_hidden_state"].squeeze(0).shape
        )
        align_data.append(tuple(doc._.trf_data.align.data.flatten().tolist()))
        align_lengths.append(tuple(doc._.trf_data.align.lengths.tolist()))

    if wordpieces_strings[0] != wordpieces_strings[1]:
        print("[red]Different wordpieces_strings[/red]")

    if wordpieces_input_ids[0] != wordpieces_input_ids[1]:
        print("[red]Different wordpieces_input_ids[/red]")

    if wordpieces_attention_mask[0] != wordpieces_attention_mask[1]:
        print("[red]Different wordpieces_attention_mask[/red]")

    if model_output_last_hidden_state[0] != model_output_last_hidden_state[1]:
        print("[red]Different model_output_last_hidden_state[/red]")

    if align_data[0] != align_data[1]:
        print(align_data[0])
        print(align_data[1])
        print("[red]Different align_data[/red]")

    if align_lengths[0] != align_lengths[1]:
        print(align_lengths[0])
        print(align_lengths[1])
        print("[red]Different align_lengths[/red]")

    print()

print("[bold purple]Pure huggingface transformers:[/bold purple]")
print()
for model_name in MODELS:
    print(f"[bold blue]Model: {model_name}[/bold blue]")
    inp = []
    att = []
    for use_fast in [True, False]:
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            model_max_length=model_max_length,
            use_fast=use_fast,
        )

        inputs = tokenizer(text)

        input_ids = tuple(inputs["input_ids"])
        attention_mask = tuple(inputs["attention_mask"])

        inp.append(input_ids)
        att.append(attention_mask)

    if inp[0] != inp[1]:
        print("[red]Different input_ids[/red]")

    if att[0] != att[1]:
        print("[red]Different attention masks[/red]")

Output:

Model: distilroberta-base
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
Different align_data
(4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
(4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
Different align_lengths

Model: BAAI/bge-small-en-v1.5

Model: microsoft/deberta-v3-xsmall
(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
Different align_data
(2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 4, 2, 2, 2, 2, 1, 1)
(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
Different align_lengths

Model: microsoft/Multilingual-MiniLM-L12-H384
(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
Different align_data
(2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1)
(1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Different align_lengths

Pure huggingface transformers:

Model: distilroberta-base
Model: BAAI/bge-small-en-v1.5
Model: microsoft/deberta-v3-xsmall
Model: microsoft/Multilingual-MiniLM-L12-H384

Your Environment

Operating System: ubuntu 24
Python Version Used: 3.12.3
spaCy Version Used: 3.8.3
Environment Information:

uv pip list | grep "spacy"                  
spacy                                 3.8.3
spacy-alignments                      0.9.1
spacy-curated-transformers            0.3.0
spacy-legacy                          3.0.12
spacy-loggers                         1.0.5
spacy-lookups-data                    1.0.5
spacy-transformers                    1.3.5
spacy-utils                           0.1.0

uv pip list | grep "transformers"           
curated-transformers                  0.1.1
spacy-curated-transformers            0.3.0
spacy-transformers                    1.3.5
transformers                          4.36.2

Info about spaCy

spaCy version: 3.8.3
Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
Python version: 3.12.3
Pipelines: en_core_web_trf (3.8.0), en_core_web_sm (3.8.0), en_core_web_lg (3.8.0), en_core_web_md (3.8.0)

The text was updated successfully, but these errors were encountered:

DuyguA · 2025-01-29T17:07:52Z

Actually, that's most probably related to some behavior differences of fast and slow tokenizers of HuggingFace library itself. Like this issues:

huggingface/transformers#29159
huggingface/transformers#32564

In my experience, fast tokenizers are better implemented and has better behaviors in most cases. Slow tokenizer implementation tries to match the behavior of fast tokenizer.

So, final conclusion if you observe such differences I suggest raising the issue in HF library 😉

Rassibassi · 2025-01-29T17:35:15Z

Thank you for your response. However, I believe I may not have made my key finding clear enough in the original issue. While I agree that there have been historical differences between the fast and slow versions of HuggingFace tokenizers, I believe this issue is specifically related to spacy-transformers.

My reproduction script demonstrates this by comparing two scenarios:

Using the tokenizers through spacy-transformers, where discrepancies appear in the alignment data
Using the tokenizers directly through HuggingFace, where both versions produce identical outputs

Importantly, when using spacy-transformers, all tokenizer outputs (input_ids, attention_masks, etc.) are consistent between fast and slow versions. The discrepancy only appears in the align_data attribute, which is generated by spacy-transformers itself.

Given these observations, I suspect the issue lies in how spacy-transformers processes the tokenizer output to generate token alignments. I'll investigate this further by examining the alignment generation logic in spacy-transformers and report back.

DuyguA · 2025-01-29T17:44:28Z

Oppsi do, then yes this is a really issue. I can examine a bit too when I have time.

Rassibassi · 2025-01-29T18:57:47Z

I've narrowed down the issue to how spacy-transformers handles different tokenizer implementations. The key difference lies in which alignment function gets called. Definitely, the huggingface tokenizer is not completely innocent here ;), in any case our downstream trained classification head suddenly stopped working when switching tokenizers from slow <-> fast.

It seems even that the fast tokenizer alignment is the false one

With use_fast=True: the tokenizer provides offset_mapping, triggering get_alignment_via_offset_mapping()
With use_fast=False: no offset mapping is provided, falling back to get_alignment()

Here's a minimal reproduction script that demonstrates the issue:

Reproduce via the test script here and some breakpoints set as shown below:

test-script.py

import spacy

model_name = "microsoft/deberta-v3-xsmall"
text = """Copenhagen is the capital and most populous city of Denmark, with a population of 1.4 million in the urban area."""

for use_fast in [True, False]:
    nlp = spacy.blank("en")
    config = {
        "model": {
            "@architectures": "spacy-transformers.TransformerModel.v3",
            "name": model_name,
            "tokenizer_config": {
                "use_fast": use_fast,
                "model_max_length": 1024,
            },
            "get_spans": {"@span_getters": "spacy-transformers.doc_spans.v1"},
        },
    }
    nlp.add_pipe("transformer", config=config)
    nlp.initialize()

    # Add breakpoint before processing
    doc = nlp(text)

with 2 breakpoints import pdb; pdb.set_trace() in the spacy-transformer -> spacy_transformers/layers/transformer_model.py [1]

[1] https://github.com/explosion/spacy-transformers/blob/da1f682653285368189cd01cb982eb67e3310256/spacy_transformers/layers/transformer_model.py#L187

# spacy_transformers/layers/transformer_model.py
# [...]
    if "logger" in model.attrs:
        log_batch_size(model.attrs["logger"], wordpieces, is_train)
    
    if "offset_mapping" in batch_encoding:
        align = get_alignment_via_offset_mapping(
            flat_spans,
            batch_encoding["offset_mapping"],
        )
        import pdb; pdb.set_trace()
    else:
        align = get_alignment(
            flat_spans, wordpieces.strings, tokenizer.all_special_tokens
        )
        import pdb; pdb.set_trace()
    wordpieces, align = truncate_oversize_splits(
        wordpieces, align, tokenizer.model_max_length
    )
    model_output, bp_tensors = transformer(wordpieces, is_train)
# [...]

Output via debugger:

-> wordpieces, align = truncate_oversize_splits(
(Pdb) align
Ragged(
       data=array([[ 1], [ 2], [ 2], [ 3], [ 3], [ 4], [ 4], [ 5], [ 5], [ 6], [ 6], [ 7], [ 7], [ 8], [ 8], [ 9], [ 9],
       [10], [10], [11], [12], [12], [13], [13], [14], [14], [15], [15], [16], [16], [17], [18], [19], [19], [20], [20],
       [21], [21], [22], [22], [23], [23], [24]], dtype=int32),
       lengths=array([2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 1, 1], dtype=int32),
       data_shape=(-1,),
       starts_ends=None
)
(Pdb) c
-> wordpieces, align = truncate_oversize_splits(
(Pdb) align
Ragged(
       data=array([[ 1], [ 2], [ 3], [ 4], [ 5], [ 6], [ 7], [ 8], [ 9], [10],
       [11], [12], [13], [14], [15], [16], [17], [18], [19], [20],
       [21], [22], [23], [24]], dtype=int32),
       lengths=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1], dtype=int32),
       data_shape=(-1,),
       starts_ends=None
)

Setup for reproduction

git clone https://github.com/explosion/spacy-transformers.git
cd spacy-transformers
git checkout tags/v1.3.6 -b investigate-alignment-1.3.6
uv venv
uv pip install -e ".[all]"
source .venv/bin/activate
which python
# [... write test script to file]
# [... set breakpoints in file as shown above ]
python test-script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Rassibassi commented Jan 15, 2025

DuyguA commented Jan 29, 2025

Rassibassi commented Jan 29, 2025 •

edited

Loading

DuyguA commented Jan 29, 2025

Rassibassi commented Jan 29, 2025 •

edited

Loading

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Comments

Rassibassi commented Jan 15, 2025

Description

Key Observations

Verification

Reproduction Steps

How to reproduce the behaviour

Your Environment

Info about spaCy

DuyguA commented Jan 29, 2025

Rassibassi commented Jan 29, 2025 • edited Loading

DuyguA commented Jan 29, 2025

Rassibassi commented Jan 29, 2025 • edited Loading

Rassibassi commented Jan 29, 2025 •

edited

Loading

Rassibassi commented Jan 29, 2025 •

edited

Loading