"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

arch-btw · 2024-05-01T14:28:37Z

I think the comma is not being escaped.

convert-hf-to-gguf.py : loses the token
convert-hf-to-gguf-update.py: adds token correctly

Related: #7018

Tested with Falcon and Qwen2, both fail on the same token.

The text was updated successfully, but these errors were encountered:

arch-btw · 2024-05-01T14:29:08Z

Same problem: #7022

@ggerganov

arch-btw · 2024-05-01T14:40:57Z

convert-hf-to-gguf-update.py:

convert-hf-to-gguf.py

teleprint-me · 2024-05-01T15:02:44Z

@arch-btw What are the sources? Need the links to investigate.

arch-btw · 2024-05-01T15:11:50Z

Thank you @teleprint-me the sources are:

https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat

and

https://huggingface.co/tiiuae/falcon-7b

teleprint-me · 2024-05-01T15:24:09Z

I have to go to work—have a double today—but I'll check it out between shifts.

ggerganov · 2024-05-01T19:20:01Z

I don't understand - which comma is this related to?
Are you sure you are using the two scripts with the same tokenizer?

teleprint-me · 2024-05-01T20:28:57Z

@ggerganov The arrays in the output are not equal. There are missing ~~vectors~~ encodings. The missing encodings change the hash output. I don't have time to do it between shifts, but I can take a closer look tonight. Juggling too many things rn.

arch-btw · 2024-05-01T21:01:20Z

@ggerganov Thanks, yes, I'm using the same tokenizers. It's possible that it's not the comma (see below) but something is causing convert-hf-to-gguf.py to not include this token. I think there's something strange about these symbol(s): ï¸ı that's causing convert-hf-to-gguf.py to not include it, I thought maybe the comma is being interpreted literally.

I meant this symbol here, it's some unusual type of comma:

arch-btw · 2024-05-01T21:34:36Z

Ok I think I'm getting a bit closer, it's actually the character after the comma.

https://en.wikipedia.org/wiki/Dotted_and_dotless_I_in_computing?lang=en

Update:

the ı symbol is not part of latin-1 as used here: https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py#L1805C71-L1805C78

teleprint-me · 2024-05-02T00:19:09Z

@arch-btw What OS are you using when you attempt to convert?

arch-btw · 2024-05-02T02:25:35Z

@teleprint-me , attempting to convert using arch linux. Do you think it might be OS related?

teleprint-me · 2024-05-02T02:29:03Z

@arch-btw I didn't want to assume you were using Arch (I am a fellow Arch user, btw ;).

I know encoding issues happen a lot on Windows, occasionally Mac OS X, and on Linux it's distribution dependent.

Whenever I've had issues with encodings in Arch, it's because I was missing a dependency.

I don't know what's going on here though. I think I'm going to need to download the model and try it out for myself.

teleprint-me · 2024-05-02T03:18:17Z

I can't reproduce it.

23:14:27 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(add-stablelm-hash | Δ) λ python convert-hf-to-gguf.py /mnt/valerie/models/tiiuae/falcon-7b 
Loading model: falcon-7b
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
chktok: [1212, 4824, 1001, 1212, 192, 204, 663, 49453, 2069, 742, 561, 1501, 193, 2571, 232, 206, 204, 19, 11003, 20, 8196, 126, 283, 219, 48778, 116, 13392, 204, 19, 51831, 732, 63209, 1741, 7955, 522, 20, 22438, 211, 3346, 111, 231, 2571, 111, 231, 204, 30, 204, 3138, 204, 22287, 204, 22287, 30, 204, 22287, 3138, 204, 22287, 22287, 204, 22287, 22287, 30, 204, 22287, 22287, 3138, 204, 30, 25, 30, 204, 30, 513, 30, 204, 30, 951, 30, 27171, 236, 206, 38154, 126, 38154, 225, 167, 237, 217, 38154, 221, 167, 237, 208, 38154, 228, 38154, 127, 38154, 237, 167, 237, 207, 38154, 237, 38154, 107, 38154, 126, 38154, 211, 20589, 207, 204, 42, 50087, 123, 2727, 20300, 32022, 133, 234, 17419, 30137, 28, 7858, 181, 133, 236, 204, 37057, 2228, 10666, 5052, 133, 6207, 151, 215, 150, 134, 5052, 133, 6279, 5052, 223, 151, 216, 49679, 123, 53110, 47043, 7795, 204, 7544, 7544, 7544, 8543, 8543, 17593, 3513, 3513, 12844, 51520, 17664, 4247, 295, 18, 298, 650, 204, 18, 95, 693, 332, 18, 94, 629, 23, 204, 18, 1553, 299, 1310, 42, 204, 18, 56, 416, 1310, 295, 18, 567, 717, 334, 23, 204, 18, 47, 299, 606, 596, 6696, 42, 703, 18, 16139, 241, 18, 87, 55]
chkhsh: 8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed
tokenizer.ggml.pre: falcon
chkhsh: 8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed
gguf: Adding 64784 merge(s).
gguf: Setting special token type eos to 11
gguf: Setting special token type bos to 11
Exporting model to '/mnt/valerie/models/tiiuae/falcon-7b/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model-00001-of-00002.bin'
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.attn_norm.weight, n_dims = 1, torch.bfloat16 --> float32

I get the expected hash. I am using my PR though, not sure if that has anything to do with it.

Something you might be able to try is the following:

Remove the tokenizer path along with its contents and test again

rm -rf models/tokenizers  # be careful here
python convert-hf-to-gguf-update.py 'read-api-token'

Copy and paste the generated method from step one (optional). It should be in there already though. Doesn't hurt to check it out though.

!!! Copy-paste the function above into convert-hf-to-gguf.py !!!

Generate and copy the vocab over (optional):

python3 convert-hf-to-gguf.py models/tokenizers/falcon/ --outfile models/ggml-vocab-falcon.gguf --vocab-only

Then try converting the model again.

python convert-hf-to-gguf.py /path/to/models/tiiuae/falcon-7b

I only needed to do steps 1 and 4. I had to do steps 2 and 3 to make sure my PR was working.

I suggest this because I ran into issues the first time I tried to do this. I only got it to work because I sanitized the environment. Arch recently upgraded to python 3.12, so that threw me off guard because I've been so swamped with stuff, so I had to clear the venv and start fresh.

arch-btw · 2024-05-02T08:21:47Z

Thank you @teleprint-me , I carefully followed all your steps and now it's working.
This is very strange, I had just updated my llama.cpp and venv maybe 1 or 2 days ago.
Like you said, it could be the sanitized environment and/or recent python upgrade.
Thank you for all your help and I apologize for the confusion.

arch-btw added the bug-unconfirmed label May 1, 2024

This was referenced May 1, 2024

Supporting phi-2 tokenizer #7022

Closed

chore: Add hashsum for stablelm models #7018

Closed

ggerganov added the need more info The OP should provide more details about the issue label May 1, 2024

arch-btw closed this as completed May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

arch-btw commented May 1, 2024 •

edited

Loading

arch-btw commented May 1, 2024

arch-btw commented May 1, 2024

teleprint-me commented May 1, 2024

arch-btw commented May 1, 2024

teleprint-me commented May 1, 2024 •

edited

Loading

ggerganov commented May 1, 2024

teleprint-me commented May 1, 2024 •

edited

Loading

arch-btw commented May 1, 2024

arch-btw commented May 1, 2024 •

edited

Loading

teleprint-me commented May 2, 2024 •

edited

Loading

arch-btw commented May 2, 2024

teleprint-me commented May 2, 2024 •

edited

Loading

teleprint-me commented May 2, 2024 •

edited

Loading

arch-btw commented May 2, 2024

"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

"ï¸ı" is causing chktok to mismatch when using chkhsh #7024

Comments

arch-btw commented May 1, 2024 • edited Loading

arch-btw commented May 1, 2024

arch-btw commented May 1, 2024

teleprint-me commented May 1, 2024

arch-btw commented May 1, 2024

teleprint-me commented May 1, 2024 • edited Loading

ggerganov commented May 1, 2024

teleprint-me commented May 1, 2024 • edited Loading

arch-btw commented May 1, 2024

arch-btw commented May 1, 2024 • edited Loading

teleprint-me commented May 2, 2024 • edited Loading

arch-btw commented May 2, 2024

teleprint-me commented May 2, 2024 • edited Loading

teleprint-me commented May 2, 2024 • edited Loading

arch-btw commented May 2, 2024

arch-btw commented May 1, 2024 •

edited

Loading

teleprint-me commented May 1, 2024 •

edited

Loading

teleprint-me commented May 1, 2024 •

edited

Loading

arch-btw commented May 1, 2024 •

edited

Loading

teleprint-me commented May 2, 2024 •

edited

Loading

teleprint-me commented May 2, 2024 •

edited

Loading

teleprint-me commented May 2, 2024 •

edited

Loading