Breaking changes in v0.19.1 for tiktoken/llama3 #1512

sanderland · 2024-04-24T07:51:53Z

import tokenizers
def show_tokenization(tok, s):
    ids = tok.encode(s, add_special_tokens=False).ids
    print([(i, tok.decode([i])) for i in ids])

def show_tokenization_from_id(tok, id):
    s = tok.decode([id])
    print(f"id {id} decodes to {s!r}, which encodes to...")
    show_tokenization(tok, s)

fb_tok = tokenizers.Tokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
show_tokenization_from_id(112328)

v0.19.0
id 112328 decodes to ' Arthropoda', which encodes to...
[(1676, ' Ar'), (98643, 'throp'), (14320, 'oda')]

v0.19.1
id 112328 decodes to ' Arthropoda', which encodes to...
[(112328, ' Arthropoda')]

I have good evidence that the new behaviour is how the model was trained, but the announcement of the patch release should perhaps be a little louder in advising to e.g. retokenize all training data for particular model families.

The text was updated successfully, but these errors were encountered:

thusinh1969 · 2024-05-06T19:10:17Z

I am waiting for #1513 break changes to happend to start continual pretrain LlaMA-3 with extended vocab et all.

Not sure when this merge will happend (v0.19.2 I guess) as it is critical for LLaMA-3 for non-English corpus.

Cheers,
Steve

sanderland · 2024-05-07T07:21:06Z

@thusinh1969 What are you finding wrong with 0.19.1?

thusinh1969 · 2024-05-07T08:52:45Z

@thusinh1969 What are you finding wrong with 0.19.1?

The decoder was buggy for added token when we want to extend vocab for non-English. Being fixed I think.

meta-llama/llama3#67

Steve

github-actions · 2024-06-07T01:52:07Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-06-11T13:34:56Z

Yep the breaking change will be reverted, but we will still ship the new addition of tokens for BPE. Just gimme a week!

ArthurZucker · 2024-06-19T07:18:09Z

#1555

github-actions · 2024-07-20T01:52:05Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jun 7, 2024

github-actions bot removed the Stale label Jun 12, 2024

github-actions bot added the Stale label Jul 20, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

sanderland commented Apr 24, 2024

thusinh1969 commented May 6, 2024 •

edited

Loading

sanderland commented May 7, 2024

thusinh1969 commented May 7, 2024 •

edited

Loading

github-actions bot commented Jun 7, 2024

ArthurZucker commented Jun 11, 2024

ArthurZucker commented Jun 19, 2024

github-actions bot commented Jul 20, 2024

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Comments

sanderland commented Apr 24, 2024

thusinh1969 commented May 6, 2024 • edited Loading

sanderland commented May 7, 2024

thusinh1969 commented May 7, 2024 • edited Loading

github-actions bot commented Jun 7, 2024

ArthurZucker commented Jun 11, 2024

ArthurZucker commented Jun 19, 2024

github-actions bot commented Jul 20, 2024

thusinh1969 commented May 6, 2024 •

edited

Loading

thusinh1969 commented May 7, 2024 •

edited

Loading