Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730

Open
Rassibassi opened this issue Jan 15, 2025 · 4 comments

Comments

@Rassibassi
Copy link

Description

When using spacy-transformers with various HuggingFace models, I've discovered inconsistencies in the token alignment data (doc._.trf_data.align) between fast and slow tokenizer implementations. This issue particularly affects DeBERTa models and, to a lesser extent, RoBERTa-based models.

Key Observations

  1. DeBERTa Models: The alignment IDs are being duplicated when using the fast tokenizer:

    • Fast tokenizer produces pairs of duplicate IDs: (1, 2, 2, 3, 3, 4, 4, ...)
    • Slow tokenizer produces sequential IDs: (1, 2, 3, 4, ...)
  2. RoBERTa Models: Shows minor differences in alignment:

    • Different align_data between fast/slow implementations
    • Fast tokenizer: (4, 1, 1, 1, ..., 1, 1, 1, 1, 3, ...)
    • Slow tokenizer: (4, 1, 1, 1, ..., 1, 0, 1, 1, 3, ...)

Verification

  • The issue appears to be specific to spacy-transformers, as direct usage of HuggingFace transformers shows no such discrepancies
  • The differences affect both alignment data and lengths

Reproduction Steps

  1. Run the attached script which tests multiple models with both fast and slow tokenizer implementations
  2. Compare the alignment data and lengths between fast/slow tokenizer variants
  3. Note the systematic duplication in DeBERTa models and the alignment shifts in RoBERTa models

How to reproduce the behaviour

Run the following script:

import warnings

warnings.simplefilter("ignore")

import spacy
from rich import print
from transformers import AutoTokenizer

MODELS = [
    "distilroberta-base",
    # "roberta-base",
    # "intfloat/e5-small-v2",
    "BAAI/bge-small-en-v1.5",
    "microsoft/deberta-v3-xsmall",
    # "microsoft/deberta-v3-small",
    "microsoft/Multilingual-MiniLM-L12-H384",
    # "microsoft/deberta-v3-large",
]

model_max_length = 1024

text = """Copenhagen is the capital and most populous city of Denmark,
with a population of 1.4 million in the urban area."""

for model_name in MODELS:
    wordpieces_strings = []
    wordpieces_input_ids = []
    wordpieces_attention_mask = []

    model_output_last_hidden_state = []
    align_data = []
    align_lengths = []

    print(f"[bold blue]Model: {model_name}[/bold blue]")
    for use_fast in [True, False]:
        nlp = spacy.blank("en")
        config = {
            "model": {
                "@architectures": "spacy-transformers.TransformerModel.v3",
                "name": model_name,
                "tokenizer_config": {
                    "use_fast": use_fast,
                    "model_max_length": model_max_length,
                },
                "get_spans": {"@span_getters": "spacy-transformers.doc_spans.v1"},
            },
        }
        nlp.add_pipe("transformer", config=config)
        nlp.initialize()

        # tokenizer = nlp.get_pipe("transformer").model.tokenizer
        # print(f"[bold blue]Tokenizer: {type(tokenizer)}[/bold blue]")

        doc = nlp(text)

        wordpieces_strings.append(doc._.trf_data.wordpieces.strings[0])
        wordpieces_input_ids.append(
            tuple(doc._.trf_data.wordpieces.input_ids[0].tolist())
        )
        wordpieces_attention_mask.append(
            tuple(doc._.trf_data.wordpieces.attention_mask[0].tolist())
        )

        model_output_last_hidden_state.append(
            doc._.trf_data.model_output["last_hidden_state"].squeeze(0).shape
        )
        align_data.append(tuple(doc._.trf_data.align.data.flatten().tolist()))
        align_lengths.append(tuple(doc._.trf_data.align.lengths.tolist()))

    if wordpieces_strings[0] != wordpieces_strings[1]:
        print("[red]Different wordpieces_strings[/red]")

    if wordpieces_input_ids[0] != wordpieces_input_ids[1]:
        print("[red]Different wordpieces_input_ids[/red]")

    if wordpieces_attention_mask[0] != wordpieces_attention_mask[1]:
        print("[red]Different wordpieces_attention_mask[/red]")

    if model_output_last_hidden_state[0] != model_output_last_hidden_state[1]:
        print("[red]Different model_output_last_hidden_state[/red]")

    if align_data[0] != align_data[1]:
        print(align_data[0])
        print(align_data[1])
        print("[red]Different align_data[/red]")

    if align_lengths[0] != align_lengths[1]:
        print(align_lengths[0])
        print(align_lengths[1])
        print("[red]Different align_lengths[/red]")

    print()

print("[bold purple]Pure huggingface transformers:[/bold purple]")
print()
for model_name in MODELS:
    print(f"[bold blue]Model: {model_name}[/bold blue]")
    inp = []
    att = []
    for use_fast in [True, False]:
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            model_max_length=model_max_length,
            use_fast=use_fast,
        )

        inputs = tokenizer(text)

        input_ids = tuple(inputs["input_ids"])
        attention_mask = tuple(inputs["attention_mask"])

        inp.append(input_ids)
        att.append(attention_mask)

    if inp[0] != inp[1]:
        print("[red]Different input_ids[/red]")

    if att[0] != att[1]:
        print("[red]Different attention masks[/red]")

Output:

Model: distilroberta-base
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
Different align_data
(4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
(4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
Different align_lengths

Model: BAAI/bge-small-en-v1.5

Model: microsoft/deberta-v3-xsmall
(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
Different align_data
(2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 4, 2, 2, 2, 2, 1, 1)
(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1)
Different align_lengths

Model: microsoft/Multilingual-MiniLM-L12-H384
(1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 9, 10, 10, 11, 11, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 20, 21, 21, 22, 22, 23)
(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
Different align_data
(2, 2, 2, 2, 2, 2, 3, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1)
(1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
Different align_lengths

Pure huggingface transformers:

Model: distilroberta-base
Model: BAAI/bge-small-en-v1.5
Model: microsoft/deberta-v3-xsmall
Model: microsoft/Multilingual-MiniLM-L12-H384

Your Environment

  • Operating System: ubuntu 24
  • Python Version Used: 3.12.3
  • spaCy Version Used: 3.8.3
  • Environment Information:
uv pip list | grep "spacy"                  
spacy                                 3.8.3
spacy-alignments                      0.9.1
spacy-curated-transformers            0.3.0
spacy-legacy                          3.0.12
spacy-loggers                         1.0.5
spacy-lookups-data                    1.0.5
spacy-transformers                    1.3.5
spacy-utils                           0.1.0

uv pip list | grep "transformers"           
curated-transformers                  0.1.1
spacy-curated-transformers            0.3.0
spacy-transformers                    1.3.5
transformers                          4.36.2

Info about spaCy

  • spaCy version: 3.8.3
  • Platform: Linux-6.8.0-51-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Pipelines: en_core_web_trf (3.8.0), en_core_web_sm (3.8.0), en_core_web_lg (3.8.0), en_core_web_md (3.8.0)
@DuyguA
Copy link
Contributor

DuyguA commented Jan 29, 2025

Actually, that's most probably related to some behavior differences of fast and slow tokenizers of HuggingFace library itself. Like this issues:

huggingface/transformers#29159
huggingface/transformers#32564

In my experience, fast tokenizers are better implemented and has better behaviors in most cases. Slow tokenizer implementation tries to match the behavior of fast tokenizer.

So, final conclusion if you observe such differences I suggest raising the issue in HF library 😉

@Rassibassi
Copy link
Author

Rassibassi commented Jan 29, 2025

Thank you for your response. However, I believe I may not have made my key finding clear enough in the original issue. While I agree that there have been historical differences between the fast and slow versions of HuggingFace tokenizers, I believe this issue is specifically related to spacy-transformers.

My reproduction script demonstrates this by comparing two scenarios:

  1. Using the tokenizers through spacy-transformers, where discrepancies appear in the alignment data
  2. Using the tokenizers directly through HuggingFace, where both versions produce identical outputs

Importantly, when using spacy-transformers, all tokenizer outputs (input_ids, attention_masks, etc.) are consistent between fast and slow versions. The discrepancy only appears in the align_data attribute, which is generated by spacy-transformers itself.

Given these observations, I suspect the issue lies in how spacy-transformers processes the tokenizer output to generate token alignments. I'll investigate this further by examining the alignment generation logic in spacy-transformers and report back.

@DuyguA
Copy link
Contributor

DuyguA commented Jan 29, 2025

Oppsi do, then yes this is a really issue. I can examine a bit too when I have time.

@Rassibassi
Copy link
Author

Rassibassi commented Jan 29, 2025

I've narrowed down the issue to how spacy-transformers handles different tokenizer implementations. The key difference lies in which alignment function gets called. Definitely, the huggingface tokenizer is not completely innocent here ;), in any case our downstream trained classification head suddenly stopped working when switching tokenizers from slow <-> fast.

It seems even that the fast tokenizer alignment is the false one

  1. With use_fast=True: the tokenizer provides offset_mapping, triggering get_alignment_via_offset_mapping()
  2. With use_fast=False: no offset mapping is provided, falling back to get_alignment()

Here's a minimal reproduction script that demonstrates the issue:

Reproduce via the test script here and some breakpoints set as shown below:

test-script.py

import spacy

model_name = "microsoft/deberta-v3-xsmall"
text = """Copenhagen is the capital and most populous city of Denmark, with a population of 1.4 million in the urban area."""

for use_fast in [True, False]:
    nlp = spacy.blank("en")
    config = {
        "model": {
            "@architectures": "spacy-transformers.TransformerModel.v3",
            "name": model_name,
            "tokenizer_config": {
                "use_fast": use_fast,
                "model_max_length": 1024,
            },
            "get_spans": {"@span_getters": "spacy-transformers.doc_spans.v1"},
        },
    }
    nlp.add_pipe("transformer", config=config)
    nlp.initialize()

    # Add breakpoint before processing
    doc = nlp(text)

with 2 breakpoints import pdb; pdb.set_trace() in the spacy-transformer -> spacy_transformers/layers/transformer_model.py [1]

[1] https://github.com/explosion/spacy-transformers/blob/da1f682653285368189cd01cb982eb67e3310256/spacy_transformers/layers/transformer_model.py#L187

# spacy_transformers/layers/transformer_model.py
# [...]
    if "logger" in model.attrs:
        log_batch_size(model.attrs["logger"], wordpieces, is_train)
    
    if "offset_mapping" in batch_encoding:
        align = get_alignment_via_offset_mapping(
            flat_spans,
            batch_encoding["offset_mapping"],
        )
        import pdb; pdb.set_trace()
    else:
        align = get_alignment(
            flat_spans, wordpieces.strings, tokenizer.all_special_tokens
        )
        import pdb; pdb.set_trace()
    wordpieces, align = truncate_oversize_splits(
        wordpieces, align, tokenizer.model_max_length
    )
    model_output, bp_tensors = transformer(wordpieces, is_train)
# [...]

Output via debugger:

-> wordpieces, align = truncate_oversize_splits(
(Pdb) align
Ragged(
       data=array([[ 1], [ 2], [ 2], [ 3], [ 3], [ 4], [ 4], [ 5], [ 5], [ 6], [ 6], [ 7], [ 7], [ 8], [ 8], [ 9], [ 9],
       [10], [10], [11], [12], [12], [13], [13], [14], [14], [15], [15], [16], [16], [17], [18], [19], [19], [20], [20],
       [21], [21], [22], [22], [23], [23], [24]], dtype=int32),
       lengths=array([2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 1, 1], dtype=int32),
       data_shape=(-1,),
       starts_ends=None
)
(Pdb) c
-> wordpieces, align = truncate_oversize_splits(
(Pdb) align
Ragged(
       data=array([[ 1], [ 2], [ 3], [ 4], [ 5], [ 6], [ 7], [ 8], [ 9], [10],
       [11], [12], [13], [14], [15], [16], [17], [18], [19], [20],
       [21], [22], [23], [24]], dtype=int32),
       lengths=array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1], dtype=int32),
       data_shape=(-1,),
       starts_ends=None
)

Setup for reproduction

git clone https://github.com/explosion/spacy-transformers.git
cd spacy-transformers
git checkout tags/v1.3.6 -b investigate-alignment-1.3.6
uv venv
uv pip install -e ".[all]"
source .venv/bin/activate
which python
# [... write test script to file]
# [... set breakpoints in file as shown above ]
python test-script.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants