-
Notifications
You must be signed in to change notification settings - Fork 126
Investigate tokenizer behaviour to understand why it is different to Huggingface Tokenizer #4
Comments
Thanks for the ticket, I think this could be bit of a tricky one to debug because the GGML GPT-J tokenizer is implemented from scratch versus the Huggingface Codegen tokenizer and the latter also has a bunch of token merging logic which I don't think GGML's tokenizer has (I will try to confirm). I can't comment on whether this is likely to significantly impact the performance of the model - that would need testing empirically. Was there a specific use case you have in mind that this is blocking? |
Hey, yeah I was planning to use this for benchmarking 4bit performance of codegen models. Most of the prompts I have are over 1500 tokens or more, and these overflow 2048 tokens when tokenized incorrectly. I guess one way to get around this is to accept pretokenized inputs. |
Ah OK that makes sense thanks for clarifying. I will look into the tokenizer behaviour properly probably over the weekend but in the mean time I will see if I can add a rest endpoint to codegen server that accepts an array of tokens as a json list. Then you can pretokenize your input using the huggingface tokenizer. I'll keep you posted! |
Thanks! I just created a PR here to allow pretokenized inputs: ravenscroftj/ggml#2 It seems to work fine for me. |
That's really cool thank you for your contribution - I have accepted the MR. I will leave this ticket open as a reminder to look into the tokenizer behaviour anyway. Sidenote - I'd be really interested in your evaluation of the 4 bit model if you're willing to share it! |
If we consider the following prompt, then huggingface's tokenizer says there are 1144 tokens whereas the 2B model's log show 1473 tokens. The 6B model's logs show 1222 tokens. I downloaded the models from the google drive and have not quantized myself. I'm not sure of the cause of this discrepancy.
The text was updated successfully, but these errors were encountered: