-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking changes in v0.19.1 for tiktoken/llama3 #1512
Comments
I am waiting for #1513 break changes to happend to start continual pretrain LlaMA-3 with extended vocab et all. Not sure when this merge will happend (v0.19.2 I guess) as it is critical for LLaMA-3 for non-English corpus. Cheers, |
@thusinh1969 What are you finding wrong with 0.19.1? |
The decoder was buggy for added token when we want to extend vocab for non-English. Being fixed I think. Steve |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Yep the breaking change will be reverted, but we will still ship the new addition of tokens for BPE. Just gimme a week! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
v0.19.0
id 112328 decodes to ' Arthropoda', which encodes to...
[(1676, ' Ar'), (98643, 'throp'), (14320, 'oda')]
v0.19.1
id 112328 decodes to ' Arthropoda', which encodes to...
[(112328, ' Arthropoda')]
I have good evidence that the new behaviour is how the model was trained, but the announcement of the patch release should perhaps be a little louder in advising to e.g. retokenize all training data for particular model families.
The text was updated successfully, but these errors were encountered: