-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Token Alignment Between Fast and Slow Tokenizers in spacy-transformers #13730
Comments
Actually, that's most probably related to some behavior differences of fast and slow tokenizers of HuggingFace library itself. Like this issues: huggingface/transformers#29159 In my experience, fast tokenizers are better implemented and has better behaviors in most cases. Slow tokenizer implementation tries to match the behavior of fast tokenizer. So, final conclusion if you observe such differences I suggest raising the issue in HF library 😉 |
Thank you for your response. However, I believe I may not have made my key finding clear enough in the original issue. While I agree that there have been historical differences between the My reproduction script demonstrates this by comparing two scenarios:
Importantly, when using Given these observations, I suspect the issue lies in how |
Oppsi do, then yes this is a really issue. I can examine a bit too when I have time. |
I've narrowed down the issue to how It seems even that the fast tokenizer alignment is the false one
Here's a minimal reproduction script that demonstrates the issue: Reproduce via the test script here and some breakpoints set as shown below:
import spacy
model_name = "microsoft/deberta-v3-xsmall"
text = """Copenhagen is the capital and most populous city of Denmark, with a population of 1.4 million in the urban area."""
for use_fast in [True, False]:
nlp = spacy.blank("en")
config = {
"model": {
"@architectures": "spacy-transformers.TransformerModel.v3",
"name": model_name,
"tokenizer_config": {
"use_fast": use_fast,
"model_max_length": 1024,
},
"get_spans": {"@span_getters": "spacy-transformers.doc_spans.v1"},
},
}
nlp.add_pipe("transformer", config=config)
nlp.initialize()
# Add breakpoint before processing
doc = nlp(text) with 2 breakpoints # spacy_transformers/layers/transformer_model.py
# [...]
if "logger" in model.attrs:
log_batch_size(model.attrs["logger"], wordpieces, is_train)
if "offset_mapping" in batch_encoding:
align = get_alignment_via_offset_mapping(
flat_spans,
batch_encoding["offset_mapping"],
)
import pdb; pdb.set_trace()
else:
align = get_alignment(
flat_spans, wordpieces.strings, tokenizer.all_special_tokens
)
import pdb; pdb.set_trace()
wordpieces, align = truncate_oversize_splits(
wordpieces, align, tokenizer.model_max_length
)
model_output, bp_tensors = transformer(wordpieces, is_train)
# [...] Output via debugger:
Setup for reproduction
|
Description
When using
spacy-transformers
with various HuggingFace models, I've discovered inconsistencies in the token alignment data (doc._.trf_data.align
) between fast and slow tokenizer implementations. This issue particularly affects DeBERTa models and, to a lesser extent, RoBERTa-based models.Key Observations
DeBERTa Models: The alignment IDs are being duplicated when using the fast tokenizer:
(1, 2, 2, 3, 3, 4, 4, ...)
(1, 2, 3, 4, ...)
RoBERTa Models: Shows minor differences in alignment:
align_data
between fast/slow implementations(4, 1, 1, 1, ..., 1, 1, 1, 1, 3, ...)
(4, 1, 1, 1, ..., 1, 0, 1, 1, 3, ...)
Verification
Reproduction Steps
How to reproduce the behaviour
Run the following script:
Output:
Your Environment
Info about spaCy
The text was updated successfully, but these errors were encountered: