Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement non-greedy tokenizer that tries to maximize token lengths #242

Merged
merged 2 commits into from
Mar 17, 2023

Conversation

thement
Copy link
Contributor

@thement thement commented Mar 17, 2023

No description provided.

@Piezoid
Copy link
Contributor

Piezoid commented Mar 17, 2023

Although I haven't examined the code, I've tested it on several prompts and can already conclude that this patch allows Llama to write in French.
Previously it spouted nonsense before falling back to english.

@thement thement force-pushed the thement/fix-tokenizer branch from e40d4e0 to 3838e51 Compare March 17, 2023 17:14
@thement thement marked this pull request as ready for review March 17, 2023 17:14
utils.cpp Outdated Show resolved Hide resolved
@@ -846,6 +846,7 @@ int main(int argc, char ** argv) {
std::vector<float> logits;

// tokenize the prompt
params.prompt.insert(0, 1, ' ');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the space meant to be a separate token? I noticed that it often get fused with the first user provided token.

Copy link
Contributor Author

@thement thement Mar 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fused to the first token! This is how original python llama code parses it.
I can dig out more details if you want.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge it if results look ok.
I won't be able to have detailed look in the next few days

@thement thement force-pushed the thement/fix-tokenizer branch from 3838e51 to 7566d1a Compare March 17, 2023 17:45
@thement thement merged commit c9f670a into ggerganov:master Mar 17, 2023
mudler pushed a commit to go-skynet/llama that referenced this pull request Mar 17, 2023
…gerganov#242)

* Implement non-greedy tokenizer that tries to maximize token lengths

* Insert single space in front of the prompt

- this is to match original llama tokenizer behavior

---------

Co-authored-by: Jakub Horak <[email protected]>
antimatter15 pushed a commit to antimatter15/alpaca.cpp that referenced this pull request Mar 18, 2023
…gerganov#242)

* Implement non-greedy tokenizer that tries to maximize token lengths

* Insert single space in front of the prompt

- this is to match original llama tokenizer behavior

---------

Co-authored-by: Jakub Horak <[email protected]>
gyohng pushed a commit to gyohng/alpaca.cpp that referenced this pull request Mar 26, 2023
…gerganov#242)

* Implement non-greedy tokenizer that tries to maximize token lengths

* Insert single space in front of the prompt

- this is to match original llama tokenizer behavior

---------

Co-authored-by: Jakub Horak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants