-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differences with the llama tokenizer #167
Comments
Tokenizer recent fixes are here: Maybe its possible that it does some other preprocess like lowercase / uppercaseisms? |
Same result using the current master and reconverting the model. More interestingly, the llama tokenizer seems to produce different results for single tokens than from groups of tokens. For example:
Note that when decoding token by token the space is absent. This is the code I am using to test the llama tokenizer if you want to try it: https://gist.github.com/slaren/9f26fc4cb24685d42601b1d91d70a13a It seems that re-implementing the SentencePiece tokenizer is not going to be entirely trivial and may be a good idea to use their library after all. |
https://guillaume-be.github.io/2020-05-30/sentence_piece seems to document the SentencePiece algorithm fairly well |
I have sentencepiece integrated in c++, the branch was a pr, it's closed. We want to avoid using additional libraries. Try building that branch. Edit link https://github.com/beiller/llama.cpp/tree/feature/tokenization |
What about writing tests that compare the python implementation of tokenizer from original llama code with the current tokenizer implementation in llama.cpu and then fixing the llama.cpu tokenizer? This way we wouldn't have to add another dependency to libsentencepiece. |
My observations: Notice whitespace. They're different. Don't know why python library doesn't show it but that's how it is when talking directly to the c++ library. Sentencepiece always encodes first token with whitespace even if you ask to prepend
That's the difference with current tokenizer. |
It looks like SentencePiece has an option |
Extracted these options from the tokenizer model protobuf:
|
I implemented something very similar here: I haven't tested it yet, I'm on a train and can't download the datasets to run llama, but I'll do it this evening/tomorrow. |
The recently merged #242 still isn't accurate, for example:
Note that in the PR max_len (and with it MAX_TOKEN_LEN) is never used, so I imagine that there are some oversights there. |
ya, it's still not right. for starters you need the actual scores from the tokenizer model. Also, the article above is discussing the SentencePiece unigram model, not its BPE model, which is what LLaMA uses. |
#242 doesn't improve the responses in any way I was able to measure |
I don't expect it will. I read sentencepiece frontpage documentation and it says it uses regularization at training time. Basically, it randomly creates suboptimal tokenized strings to improve robustness. It is likely that model is already robust to variances between different tokenizers.
|
The problem is in python it's taking raw tokens and placing them in the file. It may be missing real tokens and proper whitespace. You can use my branch to build and compare tokenizers. It's important to fix but it's probably very subtle how it affects the output. |
I think we'd have to do a backwards-incompatible file format change to support all the tokenizer's features; it also gives us a chance to do some things needed by #91, like proper data alignment ahead of time. |
The commit where it was fixed to support utf8 is here but I fear it might be responsible for whitespace breaks |
On this note, I think we'd be best served parsing the tokenizer model directly (it's just a protobuf) and converting it that way. I can work on doing that, if there aren't any objections. |
That was accomplished in this pr (using python protobuf to parse the tokenizer directly) But the protobuf methods are directly exposed in sentence piece which was determined to be a cleaner solution. |
Fixed in #252 |
In this case the llama.cpp and the llama tokenizers produce different output:
Meanwhile the llama tokenizer produces:
So in one case "This" is encoded as 4013 and other as 910. I have verified that both ids decode to the same text:
I am not sure if this causes any significant differences in the generation but it may be a good idea to check it.
The text was updated successfully, but these errors were encountered: