-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work on the BPE tokenizer #3252
Conversation
Tokenizer tests work for Falcon-7B
FYI, the MSVC build failed. |
That cuda CI is still slow because it is still fetching full cuda installer from GitHub cache, Green-Sky said he's gonna try clearing the cache sometime in the future :) You can bypass that, if you have access to a windows machine, from your GitHub account settings/actions/runners get a self hosted runner script, and after you run it, give it a "windows-latest" tag, then, if you run the CI manually from your own repo ( fork ) it will run on your machine, without cache, and that pipeline takes ~4min. |
@staviq i went ahead and manually purged the caches anyway, if they still are created, i will have to take a look at the ci script again. |
Looks like correct ones just got cached. |
I tried testing with https://huggingface.co/beomi/llama-2-ko-7b Unfortunately, we immediately die at Lines 1875 to 1880 in 8781013
If I hard code line 1879 to just set it to 13 (which seems to be correct for the model) it runs into another byte not found. There's a very good chance the model vocab isn't getting converted correctly. That model also requires manually generating the
I also tried https://huggingface.co/BAAI/Aquila-7B Basically the same issues. Also, as far as I can see, this one actually doesn't even have a newline in its vocabulary. It has a less LLaMA-like vocab. The byte stuff at the beginning doesn't cover the whole ASCII range like Falcon though. I remember screwing around with this and a hacked version of the Falcon UTF generator stuff in its conversion script. I compared the with I hate to say it, but at least in terms of how close it visibly seems to working, this pull seems further away than |
@KerfuffleV2 : thanks for taking the time!
Did you use an old conversion or a conversion built by
OK, I have this laying around from earlier tests. I will check this out next. |
I converted using edit:
I'm not completely sure what you mean here. I converted from HuggingFace model so that would have the original vocabulary, right? Or did you mean something different? |
OK, but aren't they using 'sentencepiece' tokenization then? Or are we testing the |
From "type": "BPE", So they're at least claiming to be BPE? Aquila also has By the way, when I converted using
I don't think so. (I'm far from an expert on this stuff though so what I think might not be useful.) |
Wow, LLaMa based models with a different tokenization algorithm? Very interesting, I have no idea how this works.
Yes, but
It is very helpful, thanks! Edit: Ah, I think I understand: they are only using the LlaMa transformer architecture. |
Join the club. :) Aquila vocab looks closer to stuff like Falcon than the KR model which seemed to mostly be the same as normal LLaMA except for extra tokens. Maybe it's not actually supposed to be BPE but I think I tried to converting with SPM and it didn't really work either.
I wasn't really happy with #2938 but I figured it was at least better than nothing. At that point, I wasn't sure how to improve it and from GG said it seemed like the C++ side wasn't there yet so it wouldn't matter even if the conversion was perfect. If you looked at the vocab for Aquila, does it seem the Falcon vocab fix would solve the issue? One thing I noticed is that byte vocab at the beginning didn't cover the range |
Regarding #2938: scores are used by
Edit: I was mistaken : computing the ranks seems to be OK in |
@klosax : do these fixes to GPT2 tokenization help with other open issues? |
Is there anything you want me to test again at this point? |
... because everyone is using it.
I just run inference with an Edit: Falcon looks fine too me. |
This PR explicitly writes every token type as 'normal', my PR introduced a default token type which is 'normal'. There is no difference in behavior. |
Added tokens get type CONTROL. |
Point taken. I can't check it currently, but we also tested Aquila. Your comment indicates a possible other problem. |
Currently this is handled by gguf.SpecialVocab in the simple scripts, and it's left to llama.cpp to handle tokenizer.ggml.bos_token_id/tokenizer.ggml.eos_token_id/tokenizer.ggml.unknown_token_id correctly. |
Here are the
which all look like CONTROL tokens that should be filtered out. Edit: added the |
You're still talking about convert.py, I'm still not... I guess I don't understand the point you're trying to make. What should I be paying attention to here? |
|
||
// determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n' | ||
if (vocab.type == LLAMA_VOCAB_TYPE_SPM) { | ||
vocab.linefeed_id = llama_byte_to_token(vocab, '\n'); | ||
} else { | ||
vocab.linefeed_id = llama_tokenize_internal(vocab, "\n", false)[0]; | ||
vocab.linefeed_id = llama_tokenize_internal(vocab, "\u010A", false)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed it, unicode linefeed is \u000A
Is that intentional ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
…example * 'master' of github.com:ggerganov/llama.cpp: (24 commits) convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299) readme : add project status link ggml : fix build after ggerganov#3329 llm : add Refact model (ggerganov#3329) sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468) finetune : readme fix typo (ggerganov#3465) ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453) main : consistent prefix/suffix coloring (ggerganov#3425) llama : fix session saving/loading (ggerganov#3400) llama : expose model's rope_freq_scale in the API (ggerganov#3418) metal : alibi for arbitrary number of heads (ggerganov#3426) cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273) Work on the BPE tokenizer (ggerganov#3252) convert : fix vocab size when not defined in hparams (ggerganov#3421) cmake : increase minimum version for add_link_options (ggerganov#3444) CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402) gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408) gguf : general usability improvements (ggerganov#3409) cmake : make CUDA flags more similar to the Makefile (ggerganov#3420) finetune : fix ggerganov#3404 (ggerganov#3437) ...
This reverts commit ff5a3f0.
|
No, but I have never used Aquila or looked into its conversion code so I am not familiar with how its vocabulary is processed. If you think there is something the simple scripts are missing that would ensure correct functionality of the tokenizer, you'll have to clearly explain it. |
* Work on the BPE tokenizer Tokenizer tests work for Falcon-7B * Try to fix build problem * Fix debug assertion failure * Fix MSVC Unicode BOM problem * Cleanup and an improvement * Fix compiler warning * Cleanup * Test doesn't work over the full range of Unicodes * Update .gitignore and Makefile * Another Makefile rule * Testing Aquila * Moving byte decoding back to `token_to_piece` ... ... because everyone is using it. * Guarding some unusable code pathes * Streamlining code and adding some more assertions Important change: I'm classifying added tokens as control tokens now for BPE. * Adding a comment * Adding another assertion * Fixed vocabulary guarding assertions * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fix PR for recent change * Fix PR for recent change * Fix PR for recent change * Fix for compiler warning * Fixes for more compiler warnings * Remove unused code * Fix initialization of static maps * Add scores and token types back, adapt gptneox * Update llama.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update unicode.h Co-authored-by: Georgi Gerganov <[email protected]> * Update unicode.h Co-authored-by: Georgi Gerganov <[email protected]> * Ported Starcoder and added some assertions * Fix coding style * Apply @jploski 's fix for missing tokens --------- Co-authored-by: Georgi Gerganov <[email protected]>
There is a whole collection of .gguf falcon models derived from the "original" falcon 7b and 40b models at https://huggingface.co/maddes8cht |
I am adding it back in #3680. Although it really wouldn't be hard for the model authors to update their config.json files, as the model weights haven't changed. |
Ah, thanks, that is good news! |
A large bunch of falcon Models
|
It seems like upstream isn't interested in this change for the time being [1], and we are going to break compatiblity with Nomic's previous conversion of MPT because of changes to the BPE tokenizer [2], so let's remove this change to minimize the diff. This reverts commit 69c505e. [1] ggerganov#3626 [2] ggerganov#3252
It seems like upstream isn't interested in this change for the time being [1], and we are going to break compatiblity with Nomic's previous conversion of MPT because of changes to the BPE tokenizer [2], so let's remove this change to minimize the diff. This reverts commit 69c505e. [1] ggerganov#3626 [2] ggerganov#3252
examples/gptneox-wip
migratedTokenizer tests seem to work.
Fixes #3233.