Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

byte_decoder -> byte_encoder #3021

Closed
wants to merge 1 commit into from
Closed

byte_decoder -> byte_encoder #3021

wants to merge 1 commit into from

Conversation

simonJJJ
Copy link
Contributor

@simonJJJ simonJJJ commented Sep 5, 2023

byte_encoder is mapping 0-255 -> unicode chr
byte_decoder is mapping unicode chr -> 0-255
ord returns the int of unicode chr, so it should use byte_encoder

@ggerganov take a look~

@ggerganov ggerganov requested a review from klosax September 5, 2023 06:50
@ggerganov
Copy link
Owner

Thanks for looking into this. Would like some more eyes on this to confirm it's OK to change

Btw, I don't see any difference in the output of test-tokenizer-0 with this change.
What would be expected effects of this?

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Sep 5, 2023
@simonJJJ
Copy link
Contributor Author

simonJJJ commented Sep 5, 2023

@ggerganov To give an uncommon example, when converting to ggml format, if the vocab contains Chinese and English in a subword, this will cause a bug.

@ggerganov
Copy link
Owner

Thanks for clarifying!
Let's wait for @klosax or @goerch to see if they have any feedback and we can merge.
I just lack the understanding of how unicode works and this is a blind change for me.

@goerch
Copy link
Collaborator

goerch commented Sep 5, 2023

Thanks for clarifying! Let's wait for @klosax or @goerch to see if they have any feedback and we can merge. I just lack the understanding of how unicode works and this is a blind change for me.

I'm not as fast as you guys, but I fully agree that changes without accompanying tests showing the effects are rarely acceptable.

@simonJJJ
Copy link
Contributor Author

simonJJJ commented Sep 6, 2023

@goerch, should we add a testcase for the convert scripts?

@cebtenzzre
Copy link
Collaborator

should we add a testcase for the convert scripts?

All we really need is an example of a model (and a prompt, if conversion succeeds either way) that can reproduce the incorrect behavior with the old code and correct behavior with the new code. Realistic unit tests for model conversion are not straightforward.

@goerch
Copy link
Collaborator

goerch commented Sep 18, 2023

@goerch, should we add a testcase for the convert scripts?

I added a test and some fixes for Falcon-7B, i.e. the GPT2 tokenizer, which seems to work for me. Would be great if you could test if this helps.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Sep 27, 2023

The tokenizer conversion on master does not work with e.g. mpt-7b, which I am trying to write a GGUF conversion script for. byte_decoder is obviously wrong, as it maps unicode to integers, but it is trying to index it with an integer:

Traceback (most recent call last):
  File "convert_mpt_hf_to_gguf.py", line 121, in <module>
    text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "convert_mpt_hf_to_gguf.py", line 121, in <listcomp>
    text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
                      ~~~~~~~~~~~~^^^
KeyError: ' '

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "convert_mpt_hf_to_gguf.py", line 126, in <module>
    text.append(byte_decoder[ord(c)])
                ~~~~~~~~~~~~^^^^^^^^
KeyError: 32

But this solution does not work either:

Traceback (most recent call last):
  File "convert_mpt_hf_to_gguf.py", line 126, in <module>
    text.append(byte_encoder[ord(c)])
TypeError: 'str' object cannot be interpreted as an integer

Because text is a bytearray and byte_encoder returns a unicode string.

edit: I think the correct fix is to replace ord(c) with just c, to match what it does if an exception wasn't thrown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants