Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73

Closed
wants to merge 2 commits into from

Conversation

j-f1
Copy link
Collaborator

@j-f1 j-f1 commented Mar 13, 2023

Everything seems to be working fine after regenerating and requantizing the 7B model! There may still be issues with printing the tokens, my quantization step hasn’t finished yet so I haven’t tested the updated models.

I decided to vendor the protobuf file (and the .py file generated via protoc --python_out=. sentencepiece_model.proto) since they are very very unlikely to change and so that the install process can remain simple.

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

Nice! You will still have to update the tokenizer in C++ code quite a bit. I think this is a test prompt to verify it is working:

关于爱因斯坦的生平。他出生于

If not you can try just this character as prompt:

篇篇篇篇篇篇

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

You will also need to replace spaces in input text to the unicode underscore you used in the python script now in order for it to find any token with a space.

@j-f1
Copy link
Collaborator Author

j-f1 commented Mar 13, 2023

Seems fine to me? (Using Apple Terminal). The token readout at the start is messed up as expected (since some of the tokens aren’t valid UTF-8 strings) but that’s fine IMO.

关于爱因斯坦的生平。他出生于1946年,但在不同地点时间都有频次改变 了名字( );正确为: 谢克尔斯特·安丁,英国父母姨为自由民主的家族。所以他总是被称作“不同”或者 在其他地方 “外” (abroad)、 也就说:(隐名) ;现在最后正常人生余句牧?

.

篇篇篇篇篇篇 资料:
1、本发明(权利要求一项)阶段的某些实现,可以通过例如所有元件和连接图表来对该目标设计。 For example, the embodiment of (claim one) is achieved at a certain stage invention, can be performed by all elements and connected graphic table to design target. 这些实现可以通过于权利要求一项所规定的表格,将其降?

.

The answer to the question of life, the universe, and everything is 42.
It's also the name given to a small device that can be used by any adventurer as protection from an assortment of dangers encountered while exploring strange new worlds or even just your own backyard. You don’t need much: some rope for climbing and swimming, maybe light-weight armor (maybe?), maybe a sword if you want to be fancy (grin).
This is the only game in town! [end of text]

Compare the 13B model (without my patch):

关于���确���获���������名的事实
中国人民大会常


You will also need to replace spaces in input text to the unicode underscore you used in the python script now in order for it to find any token with a space.

Those underscore things are in the token file, so I’m replacing them with a regular space when constructing the ggml bin file. I don’t think the C++ code needs to be updated to handle that?

@j-f1 j-f1 mentioned this pull request Mar 13, 2023
@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

Oh interesting, right you are that is good. But beware of the input. What does the output from the program report about your input prompt? Cause your input may be garbled I would assume from this code unable to find tokens, and the input garbled can still result in a correct output prompt.

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

See here:

std::vector<gpt_vocab::id> gpt_tokenize(const gpt_vocab & vocab, const std::string & text)

The tokenizer in this code can only return 1 token per string. You need multiple tokens for a string.

Oh edit maybe Im wrong, wrong function!

std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, const std::string & text, bool bos)

Maybe it just works!??

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

Please check the sequence of tokens. Using the tokenizer I get this and yours should match (I also have garbled at the start its consequence of the other code there):

 29871 -> ' '
 31057 -> '关'
 30909 -> '于'
   234 -> '�'
   139 -> '�'
   180 -> '�'
 31570 -> '因'
 31824 -> '斯'
   232 -> '�'
   160 -> '�'
   169 -> '�'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

So

爱 = [234, 139, 180]
坦 = [232, 160, 169]

@j-f1
Copy link
Collaborator Author

j-f1 commented Mar 13, 2023

Looks right:

     1 -> ''
 31057 -> '关'
 30909 -> '于'
   234 -> '?'
   139 -> '?'
   180 -> '?'
 31570 -> '因'
 31824 -> '斯'
   232 -> '?'
   160 -> '?'
   169 -> '?'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

That's beautiful ship it! But now I have to regenerate my models :(

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

I think this might work to avoid using protobuf?

    for i in range(32000):
        if tokenizer.is_unknown(i):
            # "<unk>" token (translated as ??)
            text = " \u2047 ".encode("utf-8")
            fout.write(struct.pack("i", len(text)))
            fout.write(text)
        elif tokenizer.is_control(i):
            # "<s>"/"</s>" tokens
            fout.write(struct.pack("i", 0))
        elif tokenizer.is_byte(i):
            # "<U+XX>" tokens (which may be invalid UTF-8)
            piece = tokenizer.id_to_piece(i)
            if len(piece) != 6:
                print("Invalid token: " + piece)
                sys.exit(1)
            byte_value = int(piece[3:-1], 16)
            fout.write(struct.pack("i", 1))
            fout.write(struct.pack("B", byte_value))
        else:
            # normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
            text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
            fout.write(struct.pack("i", len(text)))
            fout.write(text)

I can see that it writes the correct bytes, but my terminal has a hard time handling them for some reason.

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

@kharvd yes that is true. I'm somewhat confused because sentencepiece uses protobuf. Maybe the c++ version compiled into python wheel has it built in which made it not possible to use as a sub package from sentencepiece? Either way I think that approach also works.

But its irony we don't want to use sentencepiece C++ library, so instead we will use require sentencepiece python library 😅

Here is the PR to include sentencepiece C++ library

#66

Happy to close it if we merge this masterpiece. But some questions remain such as this model is not portable between the webui etc now. If we used C++ version we could have portable model files floating around between the two projects I think.

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

Oh yeah, I figured out why my terminal still made weird characters: it's the --color option that was interfering with it. @j-f1 I'm pretty sure my approach above works, no need to ship a protobuf file

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

Sample output:

main: prompt: '关于爱因斯坦的生平。他出生于'
main: number of tokens in prompt = 19
     1 -> ''
 31057 -> '关'
 30909 -> '于'
   234 -> '�'
   139 -> '�'
   180 -> '�'
 31570 -> '因'
 31824 -> '斯'
   232 -> '�'
   160 -> '�'
   169 -> '�'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

sampling parameters: temp = 1.000000, top_k = 100, top_p = 0.900000, repeat_last_n = 64, repeat_penalty = 1.300000


关于爱因斯坦的生平。他出生于一个小让之家在北京, 烧马拉龙逾八十多时间从未有晚上亲公夜光不能看成眼所会,还发现父母身高也只超过五

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

@kharvd what model are you using there, The google translate of your output appears to be gibberish. I think we need a translator :)

Heres some examples from me at 16B model
关于爱因斯坦的生平。他出生于1856年,但是在1930年才获得了诺贝尔奖。
雷恩·拉德森(英语:Raymond Pearl)(1879–1942) 是一位美国科学家和医学教育专家。
Google Translate:
About Einstein's life. He was born in 1856, but won the Nobel Prize in 1930.
Raymond Pearl (1879–1942) was an American scientist and medical educator.

关于爱因斯坦的生平。他出生于1856年,是一位欧洲科学家和教育家。他在1902年获得了诺基丛大学院士学位。
高中英语作文: 美国人物——Albert Einstein
下面为大家提供一篇关于Albert Einstein的作文,希望对同学们能够进行更多的研究。
Google Translate:
About Einstein's life. Born in 1856, he was a European scientist and educator. In 1902 he received a bachelor's degree from Norwich College.
High School English Composition: American Characters - Albert Einstein
The following is a composition about Albert Einstein, hoping to conduct more research on students.

I dont have 7B ready at the moment but it shouldn't be that bad I didn't think?

Here is the Google Translate of your output:
About Einstein's life. He was born in a small family in Beijing. He has burned maralong for more than 80 years and has never had a night when his father-in-law's night light cannot be seen. He also found that his parents are only over five

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

This is 7B

@beiller
Copy link
Contributor

beiller commented Mar 13, 2023

Ah no worries with your settings I also get gibberish at 16B

sampling parameters: temp = 1.000000, top_k = 100, top_p = 0.900000, repeat_last_n = 64, repeat_penalty = 1.300000

Try

--temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --repeat_penalty 1.17647

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

Here's 13B with default parameters:

main: prompt: '关于爱因斯坦的生平。他出生于'
main: number of tokens in prompt = 19
     1 -> ''
 31057 -> '关'
 30909 -> '于'
   234 -> '�'
   139 -> '�'
   180 -> '�'
 31570 -> '因'
 31824 -> '斯'
   232 -> '�'
   160 -> '�'
   169 -> '�'
 30210 -> '的'
 30486 -> '生'
 30606 -> '平'
 30267 -> '。'
 31221 -> '他'
 30544 -> '出'
 30486 -> '生'
 30909 -> '于'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


关于爱因斯坦的生平。他出生于1856年,就是一位德国化学家、天文学家和温谐器研究者。20世紀最初时期在高飞航母中被发现,爱因斯坦对此使用

"About Einstein's life. Born in 1856, he was a German chemist, astronomer and thermostat researcher. Found in high-flying aircraft carriers in the early 20th century, Einstein used"

@kharvd
Copy link
Contributor

kharvd commented Mar 13, 2023

#79 - Fixes both #11 and color handling

@j-f1
Copy link
Collaborator Author

j-f1 commented Mar 13, 2023

Closing, #79 is better.

44670 pushed a commit to 44670/llama.cpp that referenced this pull request Aug 2, 2023
(processing with grep, less, etc.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants