-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"ï¸ı" is causing chktok to mismatch when using chkhsh #7024
Comments
Same problem: #7022 |
@arch-btw What are the sources? Need the links to investigate. |
Thank you @teleprint-me the sources are: https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat and |
I have to go to work—have a double today—but I'll check it out between shifts. |
I don't understand - which comma is this related to? |
@ggerganov The arrays in the output are not equal. There are missing |
@ggerganov Thanks, yes, I'm using the same tokenizers. It's possible that it's not the comma (see below) but something is causing I meant this symbol here, it's some unusual type of comma: |
Ok I think I'm getting a bit closer, it's actually the character after the comma. https://en.wikipedia.org/wiki/Dotted_and_dotless_I_in_computing?lang=en Update: the ı symbol is not part of latin-1 as used here: https://github.com/ggerganov/llama.cpp/blob/master/convert-hf-to-gguf.py#L1805C71-L1805C78 |
@arch-btw What OS are you using when you attempt to convert? |
@teleprint-me , attempting to convert using arch linux. Do you think it might be OS related? |
@arch-btw I didn't want to assume you were using Arch (I am a fellow Arch user, btw ;). I know encoding issues happen a lot on Windows, occasionally Mac OS X, and on Linux it's distribution dependent. Whenever I've had issues with encodings in Arch, it's because I was missing a dependency. I don't know what's going on here though. I think I'm going to need to download the model and try it out for myself. |
I can't reproduce it. 23:14:27 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(add-stablelm-hash | Δ) λ python convert-hf-to-gguf.py /mnt/valerie/models/tiiuae/falcon-7b
Loading model: falcon-7b
gguf: This GGUF file is for Little Endian only
Set model parameters
Set model tokenizer
chktok: [1212, 4824, 1001, 1212, 192, 204, 663, 49453, 2069, 742, 561, 1501, 193, 2571, 232, 206, 204, 19, 11003, 20, 8196, 126, 283, 219, 48778, 116, 13392, 204, 19, 51831, 732, 63209, 1741, 7955, 522, 20, 22438, 211, 3346, 111, 231, 2571, 111, 231, 204, 30, 204, 3138, 204, 22287, 204, 22287, 30, 204, 22287, 3138, 204, 22287, 22287, 204, 22287, 22287, 30, 204, 22287, 22287, 3138, 204, 30, 25, 30, 204, 30, 513, 30, 204, 30, 951, 30, 27171, 236, 206, 38154, 126, 38154, 225, 167, 237, 217, 38154, 221, 167, 237, 208, 38154, 228, 38154, 127, 38154, 237, 167, 237, 207, 38154, 237, 38154, 107, 38154, 126, 38154, 211, 20589, 207, 204, 42, 50087, 123, 2727, 20300, 32022, 133, 234, 17419, 30137, 28, 7858, 181, 133, 236, 204, 37057, 2228, 10666, 5052, 133, 6207, 151, 215, 150, 134, 5052, 133, 6279, 5052, 223, 151, 216, 49679, 123, 53110, 47043, 7795, 204, 7544, 7544, 7544, 8543, 8543, 17593, 3513, 3513, 12844, 51520, 17664, 4247, 295, 18, 298, 650, 204, 18, 95, 693, 332, 18, 94, 629, 23, 204, 18, 1553, 299, 1310, 42, 204, 18, 56, 416, 1310, 295, 18, 567, 717, 334, 23, 204, 18, 47, 299, 606, 596, 6696, 42, 703, 18, 16139, 241, 18, 87, 55]
chkhsh: 8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed
tokenizer.ggml.pre: falcon
chkhsh: 8aeee3860c56296a157a1fe2fad249ec40aa59b1bb5709f4ade11c4e6fe652ed
gguf: Adding 64784 merge(s).
gguf: Setting special token type eos to 11
gguf: Setting special token type bos to 11
Exporting model to '/mnt/valerie/models/tiiuae/falcon-7b/ggml-model-f16.gguf'
gguf: loading model part 'pytorch_model-00001-of-00002.bin'
token_embd.weight, n_dims = 2, torch.bfloat16 --> float16
blk.0.attn_norm.weight, n_dims = 1, torch.bfloat16 --> float32 I get the expected hash. I am using my PR though, not sure if that has anything to do with it. Something you might be able to try is the following:
rm -rf models/tokenizers # be careful here
python convert-hf-to-gguf-update.py 'read-api-token'
python3 convert-hf-to-gguf.py models/tokenizers/falcon/ --outfile models/ggml-vocab-falcon.gguf --vocab-only
python convert-hf-to-gguf.py /path/to/models/tiiuae/falcon-7b I only needed to do steps 1 and 4. I had to do steps 2 and 3 to make sure my PR was working. I suggest this because I ran into issues the first time I tried to do this. I only got it to work because I sanitized the environment. Arch recently upgraded to python 3.12, so that threw me off guard because I've been so swamped with stuff, so I had to clear the venv and start fresh. |
Thank you @teleprint-me , I carefully followed all your steps and now it's working. |
I think the comma is not being escaped.
convert-hf-to-gguf.py
: loses the tokenconvert-hf-to-gguf-update.py
: adds token correctlyRelated: #7018
Tested with Falcon and Qwen2, both fail on the same token.
The text was updated successfully, but these errors were encountered: