Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work on the BPE tokenizer #3252

Merged
merged 35 commits into from
Oct 3, 2023
Merged

Work on the BPE tokenizer #3252

merged 35 commits into from
Oct 3, 2023

Conversation

goerch
Copy link
Collaborator

@goerch goerch commented Sep 18, 2023

  • vocabulary conversion changed
  • necessary Unicode support added
  • neccessary code from examples/gptneox-wip migrated
  • new tokenizer character test added for Falcon-7B

Tokenizer tests seem to work.

Fixes #3233.

Tokenizer tests work for Falcon-7B
@cebtenzzre
Copy link
Collaborator

FYI, the MSVC build failed.

@staviq
Copy link
Contributor

staviq commented Sep 18, 2023

That cuda CI is still slow because it is still fetching full cuda installer from GitHub cache, Green-Sky said he's gonna try clearing the cache sometime in the future :)

You can bypass that, if you have access to a windows machine, from your GitHub account settings/actions/runners get a self hosted runner script, and after you run it, give it a "windows-latest" tag, then, if you run the CI manually from your own repo ( fork ) it will run on your machine, without cache, and that pipeline takes ~4min.

This was referenced Sep 18, 2023
@Green-Sky
Copy link
Collaborator

@staviq i went ahead and manually purged the caches anyway, if they still are created, i will have to take a look at the ci script again.

@staviq
Copy link
Contributor

staviq commented Sep 18, 2023

@staviq i went ahead and manually purged the caches anyway, if they still are created, i will have to take a look at the ci script again.

Looks like correct ones just got cached.

@KerfuffleV2
Copy link
Collaborator

I tried testing with https://huggingface.co/beomi/llama-2-ko-7b

Unfortunately, we immediately die at

llama.cpp/llama.cpp

Lines 1875 to 1880 in 8781013

// determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n'
if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {
vocab.linefeed_id = llama_byte_to_token(vocab, '\n');
} else {
vocab.linefeed_id = llama_tokenize_internal(vocab, "\n", false)[0];
}

If I hard code line 1879 to just set it to 13 (which seems to be correct for the model) it runs into another byte not found.

There's a very good chance the model vocab isn't getting converted correctly. That model also requires manually generating the vocab.json from tokenizer.json (trivial Python script can do this). Aside from the extra Korean stuff it seems like the vocab is the same as a normal LLaMA model. I.E. the start of vocab.json:

{"<unk>": 0, "<s>": 1, "</s>": 2, "<0x00>": 3, "<0x01>": 4, "<0x02>": 5, "<0x03>": 6, "<0x04>": 7

I also tried https://huggingface.co/BAAI/Aquila-7B

Basically the same issues. Also, as far as I can see, this one actually doesn't even have a newline in its vocabulary. It has a less LLaMA-like vocab. The byte stuff at the beginning doesn't cover the whole ASCII range like Falcon though. I remember screwing around with this and a hacked version of the Falcon UTF generator stuff in its conversion script.

I compared the with master (just with the Aquila model) and I can get some kind of output with the newline hardcode fix which wasn't possible in this pull. It can't find space ( ) in the vocab so there are no spaces between words. It also can't seem to tokenize Chinese characters.

I hate to say it, but at least in terms of how close it visibly seems to working, this pull seems further away than master since I can't get either model to run at all. One difference seems to be that ERROR: byte not found in vocab just seems to be a warning in master while it's a hard error in this pull.

@goerch
Copy link
Collaborator Author

goerch commented Sep 19, 2023

@KerfuffleV2 : thanks for taking the time!

I tried testing with https://huggingface.co/beomi/llama-2-ko-7b

Did you use an old conversion or a conversion built by convert.py? We need the original vocabulary for this PR to work. For my Falcon tests I still used and adapted convert-falcon-hf-to-gguf.py (I seem to remember problems when using convert.py).

I also tried https://huggingface.co/BAAI/Aquila-7B

OK, I have this laying around from earlier tests. I will check this out next.

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 19, 2023

Did you use an old conversion or a conversion built by convert.py?

I converted using convert.py (since these are LLaMA models) but did a fresh conversion today.

edit:

We need the original vocabulary for this PR to work.

I'm not completely sure what you mean here. I converted from HuggingFace model so that would have the original vocabulary, right? Or did you mean something different?

@goerch
Copy link
Collaborator Author

goerch commented Sep 19, 2023

I converted using convert.py (since these are LLaMA models) but did a fresh conversion today.

OK, but aren't they using 'sentencepiece' tokenization then? BPE is for GPT2 compatible tokenization, AFAIU?

Or are we testing the special tokens problem here?

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 19, 2023

OK, but aren't they using 'sentencepiece' tokenization then?

From tokenizer.cfg in both of those models:

"type": "BPE",

So they're at least claiming to be BPE? Aquila also has "tokenizer_class": "GPT2Tokenizer", in tokenizer_config.json

By the way, when I converted using convert.py I specified --vocabtype bpe. This sets the tokenizer model in the GGUF file to gpt2 in addition to converting the vocab in the BPE way.

Or are we testing the special tokens problem here?

I don't think so. (I'm far from an expert on this stuff though so what I think might not be useful.)

@goerch
Copy link
Collaborator Author

goerch commented Sep 19, 2023

OK, but aren't they using 'sentencepiece' tokenization then?

From tokenizer.cfg in both of those models:

"type": "BPE",

So they're at least claiming to be BPE? Aquila also has "tokenizer_class": "GPT2Tokenizer", in tokenizer_config.json

Wow, LLaMa based models with a different tokenization algorithm? Very interesting, I have no idea how this works.

By the way, when I converted using convert.py I specified --vocabtype bpe. This sets the tokenizer model in the GGUF file to gpt2 in addition to converting the vocab in the BPE way.

Yes, but convert.py doesn't contain the vocabulary fix of convert-falcon-hf-to-gguf.py (yet).

Or are we testing the special tokens problem here?

I don't think so. (I'm far from an expert on this stuff though so what I think might not be useful.)

It is very helpful, thanks!

Edit: Ah, I think I understand: they are only using the LlaMa transformer architecture.

@KerfuffleV2
Copy link
Collaborator

I have no idea how this works.

Join the club. :) Aquila vocab looks closer to stuff like Falcon than the KR model which seemed to mostly be the same as normal LLaMA except for extra tokens. Maybe it's not actually supposed to be BPE but I think I tried to converting with SPM and it didn't really work either.

Yes, but convert.py doesn't contain the vocabulary fix

I wasn't really happy with #2938 but I figured it was at least better than nothing. At that point, I wasn't sure how to improve it and from GG said it seemed like the C++ side wasn't there yet so it wouldn't matter even if the conversion was perfect.

If you looked at the vocab for Aquila, does it seem the Falcon vocab fix would solve the issue? One thing I noticed is that byte vocab at the beginning didn't cover the range bytes_to_unicode in the Falcon converter uses, so it can't be used without modifications. (Not sure if that function was part of what you're talking about.)

@goerch
Copy link
Collaborator Author

goerch commented Sep 19, 2023

Regarding #2938: scores are used by find_bpe_rank , the original tokenizer uses the merges to construct them (which we currently don't).

  • Adapted convert.py
  • Added a test for Aquila

Edit: I was mistaken : computing the ranks seems to be OK in SpecialVocab.

@goerch
Copy link
Collaborator Author

goerch commented Sep 19, 2023

@klosax : do these fixes to GPT2 tokenization help with other open issues?

@KerfuffleV2
Copy link
Collaborator

Is there anything you want me to test again at this point?

... because everyone is using it.
@goerch
Copy link
Collaborator Author

goerch commented Sep 19, 2023

Is there anything you want me to test again at this point?

I just run inference with an convert.py-ed Aquila and moved byte decoding to token_to_piece, because only the tests use the detokenizers. Output looks halfway reasonable too me. Would be nice to have feed back from your side regarding Aquila and llama-2-ko-7b. I'll check inference with Falcon next.

Edit: Falcon looks fine too me.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Oct 3, 2023

But the token type we need in token_to_piece to suppress tokens like ´BOSandEOS`, I believe.

This PR explicitly writes every token type as 'normal', my PR introduced a default token type which is 'normal'. There is no difference in behavior.

@goerch
Copy link
Collaborator Author

goerch commented Oct 3, 2023

This PR explicitly writes every token type as 'normal', my PR introduced a default token type which is 'normal'. There is no difference in behavior.

Added tokens get type CONTROL.

@cebtenzzre
Copy link
Collaborator

Added tokens get type CONTROL.

That code is part of convert.py, which is only for use with LLaMA, and is not referenced by the other scripts. I did not touch convert.py in my PR.

ARCH = gguf.MODEL_ARCH.LLAMA

@goerch
Copy link
Collaborator Author

goerch commented Oct 3, 2023

That code is part of convert.py, which is only for use with LLaMA, and is not referenced by the other scripts. I did not touch convert.py in my PR.

Point taken. I can't check it currently, but we also tested Aquila. Your comment indicates a possible other problem.

@cebtenzzre
Copy link
Collaborator

Point taken. I can't check it currently, but we also tested Aquila. Your comment indicates a possible other problem.

Currently this is handled by gguf.SpecialVocab in the simple scripts, and it's left to llama.cpp to handle tokenizer.ggml.bos_token_id/tokenizer.ggml.eos_token_id/tokenizer.ggml.unknown_token_id correctly.

@goerch
Copy link
Collaborator Author

goerch commented Oct 3, 2023

Currently this is handled by gguf.SpecialVocab in the simple scripts, and it's left to llama.cpp to handle tokenizer.ggml.bos_token_id/tokenizer.ggml.eos_token_id/tokenizer.ggml.unknown_token_id correctly.

Here are the Aquila added tokens

"added_tokens": [
    {
      "id": 0,
      "content": "<|endoftext|>",
      "special": true
      [...]
    },
    {
      "id": 100000,
      "content": "<|startofpiece|>",
      "special": false
      [...]
    },
    {
      "id": 100001,
      "content": "<|endofpiece|>",
      "special": false
      [...]
    },
    {
      "id": 100002,
      "content": "<|LDWANG|>",
      "special": false
      [...]
    },
    {
      "id": 100003,
      "content": "[MASK]",
      "special": false
      [...]
    },
    {
      "id": 100004,
      "content": "[gMASK]",
      "special": false
      [...]
    },
    {
      "id": 100005,
      "content": "[sMASK]",
      "special": false
      [...]
    },
    {
      "id": 100006,
      "content": "[CLS]",
      "special": false
      [...]
    },
    {
      "id": 100007,
      "content": "</s>",
      "special": false
      [...]
    }
  ],

which all look like CONTROL tokens that should be filtered out.

Edit: added the special attribute.

@cebtenzzre
Copy link
Collaborator

Here are the Aquila added tokens

You're still talking about convert.py, I'm still not... I guess I don't understand the point you're trying to make. What should I be paying attention to here?


// determine the newline token: LLaMA "<0x0A>" == 10 == '\n', Falcon 193 == '\n'
if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {
vocab.linefeed_id = llama_byte_to_token(vocab, '\n');
} else {
vocab.linefeed_id = llama_tokenize_internal(vocab, "\n", false)[0];
vocab.linefeed_id = llama_tokenize_internal(vocab, "\u010A", false)[0];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed it, unicode linefeed is \u000A

Is that intentional ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

joelkuiper added a commit to vortext/llama.cpp that referenced this pull request Oct 5, 2023
…example

* 'master' of github.com:ggerganov/llama.cpp: (24 commits)
  convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299)
  readme : add project status link
  ggml : fix build after ggerganov#3329
  llm : add Refact model (ggerganov#3329)
  sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468)
  finetune : readme fix typo (ggerganov#3465)
  ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453)
  main : consistent prefix/suffix coloring (ggerganov#3425)
  llama : fix session saving/loading (ggerganov#3400)
  llama : expose model's rope_freq_scale in the API (ggerganov#3418)
  metal : alibi for arbitrary number of heads (ggerganov#3426)
  cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273)
  Work on the BPE tokenizer (ggerganov#3252)
  convert : fix vocab size when not defined in hparams (ggerganov#3421)
  cmake : increase minimum version for add_link_options (ggerganov#3444)
  CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402)
  gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408)
  gguf : general usability improvements (ggerganov#3409)
  cmake : make CUDA flags more similar to the Makefile (ggerganov#3420)
  finetune : fix ggerganov#3404 (ggerganov#3437)
  ...
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Oct 5, 2023
@goerch
Copy link
Collaborator Author

goerch commented Oct 6, 2023

You're still talking about convert.py, I'm still not... I guess I don't understand the point you're trying to make. What should I be paying attention to here?

Aquila is a LlaMa-like model with a BPE tokenizer, therefore we use convert.py. Do you propose to handle BPE tokenizer differently dependent on context?

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Oct 6, 2023

Do you propose to handle BPE tokenizer differently dependent on context?

No, but I have never used Aquila or looked into its conversion code so I am not familiar with how its vocabulary is processed. If you think there is something the simple scripts are missing that would ensure correct functionality of the tokenizer, you'll have to clearly explain it.

yusiwen pushed a commit to yusiwen/llama.cpp that referenced this pull request Oct 7, 2023
* Work on the BPE tokenizer

Tokenizer tests work for Falcon-7B

* Try to fix build problem

* Fix debug assertion failure

* Fix MSVC Unicode BOM problem

* Cleanup and an improvement

* Fix compiler warning

* Cleanup

* Test doesn't work over the full range of Unicodes

* Update .gitignore and Makefile

* Another Makefile rule

* Testing Aquila

* Moving byte decoding back to `token_to_piece` ...

... because everyone is using it.

* Guarding some unusable code pathes

* Streamlining code and adding some more assertions

Important change: I'm classifying added tokens as control tokens now for BPE.

* Adding a comment

* Adding another assertion

* Fixed vocabulary guarding assertions

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fix PR for recent change

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fixes for more compiler warnings

* Remove unused code

* Fix initialization of static maps

* Add scores and token types back, adapt gptneox

* Update llama.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update unicode.h

Co-authored-by: Georgi Gerganov <[email protected]>

* Update unicode.h

Co-authored-by: Georgi Gerganov <[email protected]>

* Ported Starcoder and added some assertions

* Fix coding style

* Apply @jploski 's fix for missing tokens

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@maddes8cht
Copy link
Contributor

There is a whole collection of .gguf falcon models derived from the "original" falcon 7b and 40b models at https://huggingface.co/maddes8cht
These models now no longer run in the current llama.cpp versions, but they also cannot be converted with the new convert-falcon-hf-to-gguf.py.
I don't see the model manufacturers converting their models to the new Falcon 7b / 40b variant.
Are all these models therefore lost for the new llama.cpp versions?

@cebtenzzre
Copy link
Collaborator

Are all these models therefore lost for the new llama.cpp versions?

I am adding it back in #3680. Although it really wouldn't be hard for the model authors to update their config.json files, as the model weights haven't changed.

@maddes8cht
Copy link
Contributor

maddes8cht commented Oct 20, 2023

Ah, thanks, that is good news!
So the other derived models are now available as well.
There is now a separate Falcon based collection for them, while the quantized Versions of the original tiiuae falcon models reside under their own collection.

@maddes8cht
Copy link
Contributor

A large bunch of falcon Models

  • with the new tokenizer
  • with the restored compatibility for the older .json files
  • with the new (pseudo?) K-Quants also for 7b-Models
    is available at https://huggingface.co/maddes8cht

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Dec 1, 2023
It seems like upstream isn't interested in this change for the time
being [1], and we are going to break compatiblity with Nomic's previous
conversion of MPT because of changes to the BPE tokenizer [2], so let's
remove this change to minimize the diff.

This reverts commit 69c505e.

[1] ggerganov#3626
[2] ggerganov#3252
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Dec 1, 2023
It seems like upstream isn't interested in this change for the time
being [1], and we are going to break compatiblity with Nomic's previous
conversion of MPT because of changes to the BPE tokenizer [2], so let's
remove this change to minimize the diff.

This reverts commit 69c505e.

[1] ggerganov#3626
[2] ggerganov#3252
cebtenzzre added a commit that referenced this pull request Mar 27, 2024
cebtenzzre added a commit that referenced this pull request Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(Re)activate test-tokenizer-0-falcon
9 participants