Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: tokenization of special characters: #850

Merged

Conversation

antoine-lizee
Copy link
Contributor

@antoine-lizee antoine-lizee commented Oct 30, 2023

It should behave like llama.cpp, where most out of the box usages treat special characters accordingly. See #838 (comment) for more details.

I checked that with this fix, the vanilla call to llm.create_completion(temperature=0) leads to exactly the same results for a simple chat prompt than when using ./main --temp 0 from llama.cpp - which it didn't before.

I changed the behaviour also for the embeddings and the LlamaTokenizer. I'm missing context so might be wrong on those, but I figured it would be good to be consistent.

It should behave like llama.cpp, where most out of the box usages
treat special characters accordingly
@antoine-lizee
Copy link
Contributor Author

antoine-lizee commented Oct 30, 2023

This should also make Chat Templates work properly ( #711 ) provided that we update a few of them with the eos in the right place (eg: </s> for llama2). Should solve #801, may address #800?

@antoine-lizee
Copy link
Contributor Author

@abetlen In case you missed this.

@fourdim
Copy link

fourdim commented Nov 1, 2023

What about removing the empty test.py file?

@abetlen
Copy link
Owner

abetlen commented Nov 1, 2023

@antoine-lizee looks good, I'm slightly hesistant to change the default behaviour of the completion function, would it be sufficient to only do this for chat_completion?

@fourdim
Copy link

fourdim commented Nov 1, 2023

Nope, that will not be sufficient. In my case, I'm infilling codes using bigcode/starcoder.
It has special tokens <fim_prefix>, <fim_suffix>, <fim_middle> to guide starcoder infilling the code in the middle rather than the normal completion.
If we set special to False, the model only outputs something random.

@abetlen abetlen merged commit 47ca05a into abetlen:main Nov 2, 2023
@abetlen
Copy link
Owner

abetlen commented Nov 2, 2023

I'll go ahead and merge this in as is for now, should have time in the next week to address any issues if this causes breaking changes.

@antoine-lizee thank you for the contribution!

abetlen pushed a commit that referenced this pull request Nov 2, 2023
It should behave like llama.cpp, where most out of the box usages
treat special characters accordingly
abetlen added a commit that referenced this pull request Nov 3, 2023
* Add low-level batching notebook

* fix: tokenization of special characters: (#850)

It should behave like llama.cpp, where most out of the box usages
treat special characters accordingly

* Update CHANGELOG

* Cleanup

* Fix runner label

* Update notebook

* Use llama_decode and batch api

* Support logits_all parameter

---------

Co-authored-by: Antoine Lizee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants