FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73

j-f1 · 2023-03-13T02:26:47Z

Everything seems to be working fine after regenerating and requantizing the 7B model! ~~There may still be issues with printing the tokens, my quantization step hasn’t finished yet so I haven’t tested the updated models.~~

I decided to vendor the protobuf file (and the .py file generated via protoc --python_out=. sentencepiece_model.proto) since they are very very unlikely to change and so that the install process can remain simple.

beiller · 2023-03-13T02:55:27Z

Nice! You will still have to update the tokenizer in C++ code quite a bit. I think this is a test prompt to verify it is working:

关于爱因斯坦的生平。他出生于

If not you can try just this character as prompt:

篇篇篇篇篇篇

beiller · 2023-03-13T03:04:34Z

You will also need to replace spaces in input text to the unicode underscore you used in the python script now in order for it to find any token with a space.

j-f1 · 2023-03-13T03:28:39Z

Seems fine to me? (Using Apple Terminal). The token readout at the start is messed up as expected (since some of the tokens aren’t valid UTF-8 strings) but that’s fine IMO.

关于爱因斯坦的生平。他出生于1946年，但在不同地点时间都有频次改变了名字（）；正确为：谢克尔斯特·安丁，英国父母姨为自由民主的家族。所以他总是被称作“不同”或者在其他地方 “外” (abroad)、也就说：（隐名）；现在最后正常人生余句牧?

.

篇篇篇篇篇篇 资料：
1、本发明（权利要求一项）阶段的某些实现，可以通过例如所有元件和连接图表来对该目标设计。 For example, the embodiment of (claim one) is achieved at a certain stage invention, can be performed by all elements and connected graphic table to design target. 这些实现可以通过于权利要求一项所规定的表格，将其降?

.

The answer to the question of life, the universe, and everything is 42.
It's also the name given to a small device that can be used by any adventurer as protection from an assortment of dangers encountered while exploring strange new worlds or even just your own backyard. You don’t need much: some rope for climbing and swimming, maybe light-weight armor (maybe?), maybe a sword if you want to be fancy (grin).
This is the only game in town! [end of text]

Compare the 13B model (without my patch):

关于��确��获��名的事实
中国人民大会常

You will also need to replace spaces in input text to the unicode underscore you used in the python script now in order for it to find any token with a space.

Those underscore things are in the token file, so I’m replacing them with a regular space when constructing the ggml bin file. I don’t think the C++ code needs to be updated to handle that?

beiller · 2023-03-13T03:46:34Z

Oh interesting, right you are that is good. But beware of the input. What does the output from the program report about your input prompt? Cause your input may be garbled I would assume from this code unable to find tokens, and the input garbled can still result in a correct output prompt.

beiller · 2023-03-13T03:48:21Z

See here:

std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text)

The tokenizer in this code can only return 1 token per string. You need multiple tokens for a string.

Oh edit maybe Im wrong, wrong function!

std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos)

Maybe it just works!??

beiller · 2023-03-13T03:53:01Z

Please check the sequence of tokens. Using the tokenizer I get this and yours should match (I also have garbled at the start its consequence of the other code there):

 29871 -> ' '
 31057 -> '关'
 30909 -> '于'
   234 -> '�'
   139 -> '�'
   180 -> '�'
 31570 -> '因'
 31824 -> '斯'
   232 -> '�'
   160 -> '�'
   169 -> '�'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

So

爱 = [234, 139, 180]
坦 = [232, 160, 169]

j-f1 · 2023-03-13T04:08:38Z

Looks right:

     1 -> ''
 31057 -> '关'
 30909 -> '于'
   234 -> '?'
   139 -> '?'
   180 -> '?'
 31570 -> '因'
 31824 -> '斯'
   232 -> '?'
   160 -> '?'
   169 -> '?'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

beiller · 2023-03-13T04:09:53Z

That's beautiful ship it! But now I have to regenerate my models :(

kharvd · 2023-03-13T04:23:24Z

I think this might work to avoid using protobuf?

    for i in range(32000):
        if tokenizer.is_unknown(i):
            # "<unk>" token (translated as ??)
            text = " \u2047 ".encode("utf-8")
            fout.write(struct.pack("i", len(text)))
            fout.write(text)
        elif tokenizer.is_control(i):
            # "<s>"/"</s>" tokens
            fout.write(struct.pack("i", 0))
        elif tokenizer.is_byte(i):
            # "<U+XX>" tokens (which may be invalid UTF-8)
            piece = tokenizer.id_to_piece(i)
            if len(piece) != 6:
                print("Invalid token: " + piece)
                sys.exit(1)
            byte_value = int(piece[3:-1], 16)
            fout.write(struct.pack("i", 1))
            fout.write(struct.pack("B", byte_value))
        else:
            # normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
            text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
            fout.write(struct.pack("i", len(text)))
            fout.write(text)

I can see that it writes the correct bytes, but my terminal has a hard time handling them for some reason.

beiller · 2023-03-13T04:31:06Z

@kharvd yes that is true. I'm somewhat confused because sentencepiece uses protobuf. Maybe the c++ version compiled into python wheel has it built in which made it not possible to use as a sub package from sentencepiece? Either way I think that approach also works.

But its irony we don't want to use sentencepiece C++ library, so instead we will use require sentencepiece python library 😅

Here is the PR to include sentencepiece C++ library

#66

Happy to close it if we merge this masterpiece. But some questions remain such as this model is not portable between the webui etc now. If we used C++ version we could have portable model files floating around between the two projects I think.

kharvd · 2023-03-13T04:38:16Z

Oh yeah, I figured out why my terminal still made weird characters: it's the --color option that was interfering with it. @j-f1 I'm pretty sure my approach above works, no need to ship a protobuf file

kharvd · 2023-03-13T04:41:34Z

Sample output:

main: prompt: '关于爱因斯坦的生平。他出生于'
main: number of tokens in prompt = 19
     1 -> ''
 31057 -> '关'
 30909 -> '于'
   234 -> '�'
   139 -> '�'
   180 -> '�'
 31570 -> '因'
 31824 -> '斯'
   232 -> '�'
   160 -> '�'
   169 -> '�'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

sampling parameters: temp = 1.000000, top_k = 100, top_p = 0.900000, repeat_last_n = 64, repeat_penalty = 1.300000


关于爱因斯坦的生平。他出生于一个小让之家在北京， 烧马拉龙逾八十多时间从未有晚上亲公夜光不能看成眼所会，还发现父母身高也只超过五

beiller · 2023-03-13T04:54:25Z

@kharvd what model are you using there, The google translate of your output appears to be gibberish. I think we need a translator :)

Heres some examples from me at 16B model
关于爱因斯坦的生平。他出生于1856年，但是在1930年才获得了诺贝尔奖。
雷恩·拉德森（英语：Raymond Pearl）(1879–1942) 是一位美国科学家和医学教育专家。
Google Translate:
About Einstein's life. He was born in 1856, but won the Nobel Prize in 1930.
Raymond Pearl (1879–1942) was an American scientist and medical educator.

关于爱因斯坦的生平。他出生于1856年，是一位欧洲科学家和教育家。他在1902年获得了诺基丛大学院士学位。
高中英语作文: 美国人物——Albert Einstein
下面为大家提供一篇关于Albert Einstein的作文，希望对同学们能够进行更多的研究。
Google Translate:
About Einstein's life. Born in 1856, he was a European scientist and educator. In 1902 he received a bachelor's degree from Norwich College.
High School English Composition: American Characters - Albert Einstein
The following is a composition about Albert Einstein, hoping to conduct more research on students.

I dont have 7B ready at the moment but it shouldn't be that bad I didn't think?

Here is the Google Translate of your output:
About Einstein's life. He was born in a small family in Beijing. He has burned maralong for more than 80 years and has never had a night when his father-in-law's night light cannot be seen. He also found that his parents are only over five

kharvd · 2023-03-13T04:57:14Z

This is 7B

beiller · 2023-03-13T05:05:58Z

Ah no worries with your settings I also get gibberish at 16B

sampling parameters: temp = 1.000000, top_k = 100, top_p = 0.900000, repeat_last_n = 64, repeat_penalty = 1.300000

Try

--temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647

kharvd · 2023-03-13T05:07:00Z

Here's 13B with default parameters:

main: prompt: '关于爱因斯坦的生平。他出生于'
main: number of tokens in prompt = 19
     1 -> ''
 31057 -> '关'
 30909 -> '于'
   234 -> '�'
   139 -> '�'
   180 -> '�'
 31570 -> '因'
 31824 -> '斯'
   232 -> '�'
   160 -> '�'
   169 -> '�'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


关于爱因斯坦的生平。他出生于1856年，就是一位德国化学家、天文学家和温谐器研究者。20世紀最初时期在高飞航母中被发现，爱因斯坦对此使用

"About Einstein's life. Born in 1856, he was a German chemist, astronomer and thermostat researcher. Found in high-flying aircraft carriers in the early 20th century, Einstein used"

kharvd · 2023-03-13T05:07:36Z

#79 - Fixes both #11 and color handling

j-f1 · 2023-03-13T16:01:22Z

Closing, #79 is better.

(processing with grep, less, etc.)

j-f1 mentioned this pull request Mar 13, 2023

Add sentencepiece tokenizer and modify build (Support UTF-8 / Emoijs) #66

Closed

j-f1 added 2 commits March 12, 2023 23:27

FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf

34af8a9

Add __pycache__ and *.bin to gitignore

b8f2071

j-f1 mentioned this pull request Mar 13, 2023

Unicode support #11

Closed

j-f1 force-pushed the fix-tokens branch from c4e7de8 to b8f2071 Compare March 13, 2023 03:41

kharvd mentioned this pull request Mar 13, 2023

Fix UTF-8 handling (including colors) #79

Merged

j-f1 closed this Mar 13, 2023

j-f1 deleted the fix-tokens branch March 13, 2023 16:01

ggerganov mentioned this pull request Mar 13, 2023

Add support for tokenize and untokenize of UTF-8 encoding in prompt/output #87

Closed

beiller mentioned this pull request Mar 18, 2023

Differences with the llama tokenizer #167

Closed

SagsMug mentioned this pull request Apr 26, 2023

Fix UnicodeDecodeError permanently abetlen/llama-cpp-python#118

Merged

taiyou2000 mentioned this pull request Jun 28, 2023

mpt-7b-ggml generating garbled characters LostRuins/koboldcpp#272

Open

mqy mentioned this pull request Jul 3, 2023

[llama] Add resegment post processing of tokenizer #2072

Draft

44670 pushed a commit to 44670/llama.cpp that referenced this pull request Aug 2, 2023

Enable pipe-friendly help output (ggerganov#73)

66aa59e

(processing with grep, less, etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73

FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73

j-f1 commented Mar 13, 2023 •

edited

Loading

beiller commented Mar 13, 2023

beiller commented Mar 13, 2023

j-f1 commented Mar 13, 2023 •

edited

Loading

beiller commented Mar 13, 2023

beiller commented Mar 13, 2023 •

edited

Loading

beiller commented Mar 13, 2023 •

edited

Loading

j-f1 commented Mar 13, 2023

beiller commented Mar 13, 2023

kharvd commented Mar 13, 2023

beiller commented Mar 13, 2023 •

edited

Loading

kharvd commented Mar 13, 2023

kharvd commented Mar 13, 2023

beiller commented Mar 13, 2023 •

edited

Loading

kharvd commented Mar 13, 2023

beiller commented Mar 13, 2023

kharvd commented Mar 13, 2023

kharvd commented Mar 13, 2023

j-f1 commented Mar 13, 2023

FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73

FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73

Conversation

j-f1 commented Mar 13, 2023 • edited Loading

beiller commented Mar 13, 2023

beiller commented Mar 13, 2023

j-f1 commented Mar 13, 2023 • edited Loading

beiller commented Mar 13, 2023

beiller commented Mar 13, 2023 • edited Loading

beiller commented Mar 13, 2023 • edited Loading

j-f1 commented Mar 13, 2023

beiller commented Mar 13, 2023

kharvd commented Mar 13, 2023

beiller commented Mar 13, 2023 • edited Loading

kharvd commented Mar 13, 2023

kharvd commented Mar 13, 2023

beiller commented Mar 13, 2023 • edited Loading

kharvd commented Mar 13, 2023

beiller commented Mar 13, 2023

kharvd commented Mar 13, 2023

kharvd commented Mar 13, 2023

j-f1 commented Mar 13, 2023

j-f1 commented Mar 13, 2023 •

edited

Loading

j-f1 commented Mar 13, 2023 •

edited

Loading

beiller commented Mar 13, 2023 •

edited

Loading

beiller commented Mar 13, 2023 •

edited

Loading

beiller commented Mar 13, 2023 •

edited

Loading

beiller commented Mar 13, 2023 •

edited

Loading