-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]? how does the tokenizer encode the special tokens? #23851
Comments
@jiangwy99 However, tokenizer(['who are you', '你是谁'])
output:
outputs:
[
[1, 1058, 526, 366],
[1, 29871, 30919, 30392, 235, 179, 132]
] the space |
That's quite a problem. Your analysis of the problems on the tokenizer is more comprehensive than mine, and I look forward to these issues being resolved. |
Hey, I basically answered in #23818, this is pretty much the same |
System Info
transformer version 4.28.1
Who can help?
@ArthurZucker
hi, maybe, the following issue should be asked here?
[Bug]? how does the tokenizer encode the special tokens? #1263
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi, all, I used the tokenzier to process data for llama model(already converted to hf formated) and set:
when tokenizing a piece of text with an eos_token:
The
eos_token: </s>
is encoded to829, 29879, 29958
which means</s>
is regarded as</
,s
and>
.in this time,
</s>
is encoded correctly (token id is 2).As description above, does this mean we should add a space between text and
eos_token
? however, I find many popular projects likeAlpaca
concatenate text witheos_token
without a space.I previously thought tokenizer encode text in a greedy style, the
eos_token
would be encoded correctly with or without a space. However, the experiments above seemed to not support my opinion.could anyone help me, if there is something misunderstood by me? thx.
After some other experiments, I found some weird thing:
1 is bos_token_id, 29871 is the token id of ''
when add a space
between
谁
and</s>
.when decode
[1, 29871, 30672, 30392, 235, 179, 132, 2]
the space
is ignored!
When manually add token id 29871:
this time, there is a space
between
谁
and</s>
.Does these experiments above means encode, decode methods are not completely Reciprocal reversible operation?
Expected behavior
does above experiments show bugs? if not, how should I understand these? thanks
The text was updated successfully, but these errors were encountered: