fix: tokenization of special characters: #850

antoine-lizee · 2023-10-30T11:10:24Z

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly. See #838 (comment) for more details.

I checked that with this fix, the vanilla call to llm.create_completion(temperature=0) leads to exactly the same results for a simple chat prompt than when using ./main --temp 0 from llama.cpp - which it didn't before.

I changed the behaviour also for the embeddings and the LlamaTokenizer. I'm missing context so might be wrong on those, but I figured it would be good to be consistent.

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

antoine-lizee · 2023-10-30T11:17:40Z

This should also make Chat Templates work properly ( #711 ) provided that we update a few of them with the eos in the right place (eg: </s> for llama2). Should solve #801, may address #800?

antoine-lizee · 2023-11-01T20:39:19Z

@abetlen In case you missed this.

fourdim · 2023-11-01T21:04:26Z

What about removing the empty test.py file?

abetlen · 2023-11-01T23:38:46Z

@antoine-lizee looks good, I'm slightly hesistant to change the default behaviour of the completion function, would it be sufficient to only do this for chat_completion?

fourdim · 2023-11-01T23:47:45Z

Nope, that will not be sufficient. In my case, I'm infilling codes using bigcode/starcoder.
It has special tokens <fim_prefix>, <fim_suffix>, <fim_middle> to guide starcoder infilling the code in the middle rather than the normal completion.
If we set special to False, the model only outputs something random.

abetlen · 2023-11-02T01:30:05Z

I'll go ahead and merge this in as is for now, should have time in the next week to address any issues if this causes breaking changes.

@antoine-lizee thank you for the contribution!

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

* Add low-level batching notebook * fix: tokenization of special characters: (#850) It should behave like llama.cpp, where most out of the box usages treat special characters accordingly * Update CHANGELOG * Cleanup * Fix runner label * Update notebook * Use llama_decode and batch api * Support logits_all parameter --------- Co-authored-by: Antoine Lizee <[email protected]>

fix: tokenization of special characters:

8c7b4c1

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

antoine-lizee mentioned this pull request Oct 30, 2023

Error with special tokens tokenization #838

Closed

abetlen merged commit 47ca05a into abetlen:main Nov 2, 2023

abetlen pushed a commit that referenced this pull request Nov 2, 2023

fix: tokenization of special characters: (#850)

4d4e0f1

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly

abetlen mentioned this pull request Nov 2, 2023

0.2.9 broke the bos/eos/sys handling for chat sequences #800

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: tokenization of special characters: #850

fix: tokenization of special characters: #850

antoine-lizee commented Oct 30, 2023 •

edited

Loading

antoine-lizee commented Oct 30, 2023 •

edited

Loading

antoine-lizee commented Nov 1, 2023

fourdim commented Nov 1, 2023

abetlen commented Nov 1, 2023

fourdim commented Nov 1, 2023 •

edited

Loading

abetlen commented Nov 2, 2023

fix: tokenization of special characters: #850

fix: tokenization of special characters: #850

Conversation

antoine-lizee commented Oct 30, 2023 • edited Loading

antoine-lizee commented Oct 30, 2023 • edited Loading

antoine-lizee commented Nov 1, 2023

fourdim commented Nov 1, 2023

abetlen commented Nov 1, 2023

fourdim commented Nov 1, 2023 • edited Loading

abetlen commented Nov 2, 2023

antoine-lizee commented Oct 30, 2023 •

edited

Loading

antoine-lizee commented Oct 30, 2023 •

edited

Loading

fourdim commented Nov 1, 2023 •

edited

Loading