Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]? how does the tokenizer encode the special tokens? #23851

Closed
2 of 4 tasks
vpegasus opened this issue May 30, 2023 · 4 comments · Fixed by #23909
Closed
2 of 4 tasks

[Bug]? how does the tokenizer encode the special tokens? #23851

vpegasus opened this issue May 30, 2023 · 4 comments · Fixed by #23909

Comments

@vpegasus
Copy link

System Info

transformer version 4.28.1

Who can help?

@ArthurZucker
hi, maybe, the following issue should be asked here?
[Bug]? how does the tokenizer encode the special tokens? #1263

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hi, all, I used the tokenzier to process data for llama model(already converted to hf formated) and set:

tokenizer = AutoTokenizer.from_pretrained(llama_model_id, model_max_length=1024, padding_side='right',
                                              trust_remote_code=True)
tokenizer.add_special_tokens(  
            {
                "eos_token": "</s>",
                "bos_token": "</s>",
                "unk_token": "</s>",
            })
tokenizer.pad_token = tokenizer.eos_token

when tokenizing a piece of text with an eos_token:

tokenizer(['ASSISTANT: Hello!</s>']) # there is no space between ! and </s>.
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 829, 29879, 29958]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

The eos_token: </s> is encoded to 829, 29879, 29958 which means </s> is regarded as </,s and >.

tokenizer(['ASSISTANT: Hello! </s>'])  # there is a space between ! and </s>.
output:
{'input_ids': [[1, 319, 1799, 9047, 13566, 29901, 15043, 29991, 2]],
  'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0]],
  'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1]]}

in this time, </s> is encoded correctly (token id is 2).

As description above, does this mean we should add a space between text and eos_token? however, I find many popular projects like Alpaca concatenate text with eos_token without a space.

I previously thought tokenizer encode text in a greedy style, the eos_token would be encoded correctly with or without a space. However, the experiments above seemed to not support my opinion.

could anyone help me, if there is something misunderstood by me? thx.


After some other experiments, I found some weird thing:

tokenizer('我是谁')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132] 

1 is bos_token_id, 29871 is the token id of ''

tokenizer('我是谁</s>')
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132, 829, 29879, 29958]

tokenizer('who are you</s>')
output:
'input_ids': [1, 1058, 526, 366, 829, 29879, 29958] # there is no 29871.

when add a space between and </s>.

tokenizer('我是谁 </s>') 
output:
'input_ids': [1, 29871, 30672, 30392, 235, 179, 132, 2] # the `</s>` is encoded correctly

when decode [1, 29871, 30672, 30392, 235, 179, 132, 2]

tokenizer.decode([1, 29871, 30672, 30392, 235, 179, 132, 2])
output:
'<s> 我是谁</s>' 

the space is ignored!

When manually add token id 29871:

tokenizer.decode([1, 29871, 30672, 30392, 235, 179, 132, 29871, 2])
output:
'<s> 我是谁 </s>' 

this time, there is a space between and </s>.

Does these experiments above means encode, decode methods are not completely Reciprocal reversible operation?

Expected behavior

does above experiments show bugs? if not, how should I understand these? thanks

@jiangwangyi
Copy link
Contributor

#23818

@vpegasus
Copy link
Author

#23818

@jiangwy99
thanks very much, when set use_fast=False, this indeed encode correctly, whether the space exists.

However,

tokenizer(['who are you', '你是谁'])
output:

outputs:
[
[1, 1058, 526, 366], 
[1, 29871, 30919, 30392, 235, 179, 132]
]

the space in front Chinese characters still exists.

@jiangwangyi
Copy link
Contributor

#23818

@jiangwy99 thanks very much, when set use_fast=False, this indeed encode correctly, whether the space exists.

However,

tokenizer(['who are you', '你是谁'])
output:

outputs:
[
[1, 1058, 526, 366], 
[1, 29871, 30919, 30392, 235, 179, 132]
]

the space in front Chinese characters still exists.

That's quite a problem. Your analysis of the problems on the tokenizer is more comprehensive than mine, and I look forward to these issues being resolved.

@ArthurZucker
Copy link
Collaborator

Hey, I basically answered in #23818, this is pretty much the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants