-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests : add test-tokenizer-0.sh #7036
Conversation
9998b08
to
ce7d3a0
Compare
I think there is a bug in the way we handle added tokens. I'm experimenting with DeepSeek-Coder: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base
{
"id": 32009,
"content": "ü",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
}, If I remove this added token from the Any ideas how to fix this? |
I believe the issue is this, added tokens are always looked up first.
AFAICT the only way to fix this is to add added tokens to the GGUF separately, which will be esp. complicated if the added tokens are merged to the middle of the existing vocab (otherwise just adding an index to the beginning of added tokens would be enough). |
From what I've found, the problem seem to be that as part of the pre-tokenization, we perform some byte-to-unicode mapping here: Lines 213 to 217 in 3275e60
Lines 151 to 173 in 3275e60
This converts the string |
Why is the upper limit set to 256? Isn't that the ASCII range? The range of valid Unicode code points is from U+0000 (hexadecimal 0) to U+1FFFFF (hexadecimal FFFF), which covers more than 1 million unique characters. |
74fa6cd
to
5f30e30
Compare
This seems to be some strategy to reduce the vocab size: https://github.com/openai/gpt-2/blob/master/src/encoder.py#L8-L28 |
I find it fascinating how we have a tendency to over-complicate simple ideas. I'm all too guilty of this myself. # simplified function definition
@lru_cache()
def bytes_to_unicode(size: int = 256) -> dict[int, str]:
"""
This function generates a dictionary mapping each byte to its corresponding Unicode character.
:param size: The total number of bytes in the encoding space (default is 256 for ASCII).
:return: A dictionary containing mappings between bytes and their respective Unicode characters.
"""
# list of visible characters:
# (ord("!"), ord("~") + 1); (ord("¡"), ord("¬") + 1); (ord("®"), ord("ÿ") + 1);
visible = list(range(33, 127)) + list(range(161, 173)) + list(range(174, 256))
mapping: dict = {}
for byte in list(range(size)):
# convert "visible" characters
if byte in visible:
mapping[byte] = chr(byte)
else: # translate and convert non-printable characters
mapping[byte] = chr(byte + size)
return mapping where the upper limit can be defined as outputGet the mapping: mapping = bytes_to_unicode()
gpt_mapping = gpt_bytes_to_unicode()
for key in mapping.keys():
assert mapping[key] == gpt_mapping[key]
from pprint import pprint # pretty print output
pprint(mapping) Mapping output: 18:30:48 | ~
λ python -i /tmp/bytes_to_unicode.py
{0: 'Ā',
1: 'ā',
2: 'Ă',
3: 'ă',
4: 'Ą',
5: 'ą',
6: 'Ć',
7: 'ć',
8: 'Ĉ',
9: 'ĉ',
10: 'Ċ',
11: 'ċ',
12: 'Č',
13: 'č',
14: 'Ď',
15: 'ď',
16: 'Đ',
17: 'đ',
18: 'Ē',
19: 'ē',
20: 'Ĕ',
21: 'ĕ',
22: 'Ė',
23: 'ė',
24: 'Ę',
25: 'ę',
26: 'Ě',
27: 'ě',
28: 'Ĝ',
29: 'ĝ',
30: 'Ğ',
31: 'ğ',
32: 'Ġ',
33: '!',
34: '"',
35: '#',
36: '$',
37: '%',
38: '&',
39: "'",
40: '(',
41: ')',
42: '*',
43: '+',
44: ',',
45: '-',
46: '.',
47: '/',
48: '0',
49: '1',
50: '2',
51: '3',
52: '4',
53: '5',
54: '6',
55: '7',
56: '8',
57: '9',
58: ':',
59: ';',
60: '<',
61: '=',
62: '>',
63: '?',
64: '@',
65: 'A',
66: 'B',
67: 'C',
68: 'D',
69: 'E',
70: 'F',
71: 'G',
72: 'H',
73: 'I',
74: 'J',
75: 'K',
76: 'L',
77: 'M',
78: 'N',
79: 'O',
80: 'P',
81: 'Q',
82: 'R',
83: 'S',
84: 'T',
85: 'U',
86: 'V',
87: 'W',
88: 'X',
89: 'Y',
90: 'Z',
91: '[',
92: '\\',
93: ']',
94: '^',
95: '_',
96: '`',
97: 'a',
98: 'b',
99: 'c',
100: 'd',
101: 'e',
102: 'f',
103: 'g',
104: 'h',
105: 'i',
106: 'j',
107: 'k',
108: 'l',
109: 'm',
110: 'n',
111: 'o',
112: 'p',
113: 'q',
114: 'r',
115: 's',
116: 't',
117: 'u',
118: 'v',
119: 'w',
120: 'x',
121: 'y',
122: 'z',
123: '{',
124: '|',
125: '}',
126: '~',
127: 'ſ',
128: 'ƀ',
129: 'Ɓ',
130: 'Ƃ',
131: 'ƃ',
132: 'Ƅ',
133: 'ƅ',
134: 'Ɔ',
135: 'Ƈ',
136: 'ƈ',
137: 'Ɖ',
138: 'Ɗ',
139: 'Ƌ',
140: 'ƌ',
141: 'ƍ',
142: 'Ǝ',
143: 'Ə',
144: 'Ɛ',
145: 'Ƒ',
146: 'ƒ',
147: 'Ɠ',
148: 'Ɣ',
149: 'ƕ',
150: 'Ɩ',
151: 'Ɨ',
152: 'Ƙ',
153: 'ƙ',
154: 'ƚ',
155: 'ƛ',
156: 'Ɯ',
157: 'Ɲ',
158: 'ƞ',
159: 'Ɵ',
160: 'Ơ',
161: '¡',
162: '¢',
163: '£',
164: '¤',
165: '¥',
166: '¦',
167: '§',
168: '¨',
169: '©',
170: 'ª',
171: '«',
172: '¬',
173: 'ƭ',
174: '®',
175: '¯',
176: '°',
177: '±',
178: '²',
179: '³',
180: '´',
181: 'µ',
182: '¶',
183: '·',
184: '¸',
185: '¹',
186: 'º',
187: '»',
188: '¼',
189: '½',
190: '¾',
191: '¿',
192: 'À',
193: 'Á',
194: 'Â',
195: 'Ã',
196: 'Ä',
197: 'Å',
198: 'Æ',
199: 'Ç',
200: 'È',
201: 'É',
202: 'Ê',
203: 'Ë',
204: 'Ì',
205: 'Í',
206: 'Î',
207: 'Ï',
208: 'Ð',
209: 'Ñ',
210: 'Ò',
211: 'Ó',
212: 'Ô',
213: 'Õ',
214: 'Ö',
215: '×',
216: 'Ø',
217: 'Ù',
218: 'Ú',
219: 'Û',
220: 'Ü',
221: 'Ý',
222: 'Þ',
223: 'ß',
224: 'à',
225: 'á',
226: 'â',
227: 'ã',
228: 'ä',
229: 'å',
230: 'æ',
231: 'ç',
232: 'è',
233: 'é',
234: 'ê',
235: 'ë',
236: 'ì',
237: 'í',
238: 'î',
239: 'ï',
240: 'ð',
241: 'ñ',
242: 'ò',
243: 'ó',
244: 'ô',
245: 'õ',
246: 'ö',
247: '÷',
248: 'ø',
249: 'ù',
250: 'ú',
251: 'û',
252: 'ü',
253: 'ý',
254: 'þ',
255: 'ÿ'}
>>> I'm looking into it though. |
* tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update
* tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update
* tests : add test-tokenizer-0.sh * unicode : add all unicode number ranges * starcoder : fix pre-tokenizer * tests : add test that fails with DeepSeek tokenizers * falcon : fix regex * unicode : regenerate unicode tables * refact : add tokenizer model * lint : fix * tests : disable failing tests ggml-ci * refact : add tests files ggml-ci * convert : print -> logging ggml-ci * lint : fix * unicode : digit -> number * phi-3 : update
Hi, I found that current llama.cpp can not pass the unit tests for deepseek models. The problem you mentioned looks like the issue huggingface tokenizers have solved huggingface/tokenizers#1392 for the newly published deepseek v2 and deepseekcoder v1.5, these added tokens are removed. |
Add more extensive tokenizer test that takes a text file, tokenizes it using
transformers
andllama.cpp
and compares the results.Need to find the reason why the tokenization differs in the Fail cases. For example, DeepSeek models fail like this:
Added script for generating the unicode ranges in
unicode-data.cpp
: