-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer WPM fixes for bert-bge and jina-v2-en #7500
Conversation
bert-bpe is failing with a simple "Hello world".
Seems that WPM does not follow this assumption:
@ggerganov, Can't we just trust There are a few more modifications to make, but this is the main problem. |
@jaime-m-p Can you try out applying the GPT-2 pre-tokenizer instead of the "default" value that's used on llama.cpp#L12360? I think that might be part of the issue. It's a symptom I'm working on at the moment. |
The usual value for |
Do you mean apply the GPT-2 regex split, then the WPM processing? Actually the WPM is not using any explicit regex split. I'm looking close to your PR. |
Jina uses the BERT Pre-tokenizer. The BERT Pre-tokenizer inherits from the ByteLevel Pre-tokenizer. The ByteLevel Pre-tokenizer defaults to GPT-2 Pre-tokenizer. The Pre-tokenizer is responsible for partitioning its normalized substrings. If the substrings are split the right way, it should resolve the encoding issue. I haven't tested this my self yet, it's just a hypothesis based on my research so far.
This sounds good. Whatever you think is best.
Yes, I am. I'm still working on this part, though.
Awesome! I really appreciate that. I'm not as familiar with C++ as I am with C and Python. |
@teleprint-me I tested the default regex vs GPT-2 regex. The models jina-* are already using GPT-2 regex. I forced it to use default regex for comparing.
|
Hm. Interesting. 🤔 The original expression used for GPT-2 is the following: "'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+" https://github.com/openai/gpt-2/blob/master/src/encoder.py#L53 The The others should be omitted completely from pre-tok for testing purposes. This way, we at least have a control group as per |
Fix unicode edge case combinations. Split by whitspace in the same pass.
e92c3f8
to
f3f6c0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov, Can't we just trust vocab.id_to_token[id].type?
I think so
Modifications to make WPM tokenizer match AutoTokenizer.
Tested with vocabs from models bert-bge and jina-v2-en.