-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling preprocessing with token-classification pipeline #15785
Comments
Hi @tkon3 , Normally this should be taken care of directly by the pre_tokenizer. Doing something like from transformers import AutoTokenizer
from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Punctuation, WhitespaceSplit, Metaspace
MODEL_NAME = "camembert-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer._tokenizer.pre_tokenizer = pre_tokenizers.Sequence([WhitespaceSplit(), Punctuation(), Metaspace()])
text = "Ceci est un exemple n'est-ce pas ?"
print(tokenizer.convert_ids_to_tokens(tokenizer(text).input_ids))
print(tokenizer(text).tokens())
# ['<s>', '▁Ceci', '▁est', '▁un', '▁exemple', '▁n', '▁', "'", '▁est', '▁-', '▁ce', '▁pas', '▁?', '</s>']
# ['<s>', '▁Ceci', '▁est', '▁un', '▁exemple', '▁n', '▁', "'", '▁est', '▁-', '▁ce', '▁pas', '▁?', '</s>']
tokenizer.save_pretrained("camembert-punctuation") Should be enough so that no one forgets how the input was preprocessed at training and make pipelines work along offsets. Does that work ? There's no real other way to handle custom pretokenization in pipelines, it needs to be included somehow directly in the tokenizer's method. |
hi @Narsil Thank you for the quick answer, your idea works.
To get the correct word, I have to manually extract it from the input sentence. Not a big deal in my case but can be a problem for someone else:
|
Well, "words" don't really exist for this particular brand of tokenizer (Unigram with WhitespaceSplit). If I am not mistaken. Extracting from the original string like you did is the only way to do it consistently. Do you mind creating a new issue for this, since it seems like a new problem ? I do think what you're implying is correct that we should have |
You are right, this is more general than what I expected. It is similar with uncased tokenizers: from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
MODEL_NAME = "elastic/distilbert-base-uncased-finetuned-conll03-english"
text = "My name is Clara and I live in Berkeley, California."
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
pipe = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
entities = pipe(text)
print([(text[entity["start"]:entity["end"]], entity["word"]) for entity in entities])
# [('Clara', 'clara'), ('Berkeley', 'berkeley'), ('California', 'california')] Adding a new key is probably the easiest way. grouped_entities = self.aggregate(pre_entities, aggregation_strategy)
if some_postprocess_param:
grouped_entities = [{"raw_word": sentence[e["start"]:e["end"]], **e} for e in grouped_entities] Want me to do a feature request ? |
I think we should move to another issue/feature request for this yes (the discussion has diverged since the first message). Can you ping me, and Lysandre on it and refer to this original issue for people wanting more context ? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi,
Token classification tasks (e.g NER) usually rely on splitted inputs (a list of words). The tokenizer is then used with is_split_into_words=True argument during training.
However the pipeline for token-classification does not handle this preprocessing and tokenizes the raw input.
This can lead to different predictions if we use some custom preprocessing because tokens are different.
How can you make reliable predictions if you exactly know how the input is preprocessed ?
Pre-tokenizer and tokenizer mapping_offsets have to be merged somehow.
The text was updated successfully, but these errors were encountered: