Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding with BertWordPieceTokenizer doesn't combine subword tokens #145

Closed
tomhosking opened this issue Feb 13, 2020 · 5 comments · Fixed by #147
Closed

Decoding with BertWordPieceTokenizer doesn't combine subword tokens #145

tomhosking opened this issue Feb 13, 2020 · 5 comments · Fixed by #147

Comments

@tomhosking
Copy link

tomhosking commented Feb 13, 2020

The BertWordPieceTokenizer doesn't seem to correctly combine subword tokens when decoding. For example:

from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer('./data/bert-vocabs/bert-base-cased-vocab.txt', lowercase=False)
output = tokenizer.encode('The word juggler is not in the vocab')
print(tokenizer.decode(output.ids))

The word j ##ug ##gler is not in the v ##oc ##ab

I would have expected this operation to be reversible, and to return the original input string with the subword tokens combined into a single word?

Using Python 3.6.9, tokenizers==0.4.2

@tomhosking
Copy link
Author

tomhosking commented Feb 13, 2020

In fact, decoding behaves differently to transformers==2.3.0 for punctuation as well, even though the tokens are the same:

from tokenizers import BertWordPieceTokenizer
tok = BertWordPieceTokenizer('./data/bert-vocabs/bert-base-cased-vocab.txt', lowercase=False)
output = tok.encode("Hello, y'all! How are you?")
print(output.ids)
print(tok.decode(output.ids))

[101, 8667, 117, 194, 112, 1155, 106, 1731, 1132, 1128, 136, 102]
Hello , y ' all ! How are you ?

from transformers import BertTokenizer
old_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
output = old_tokenizer.encode("Hello, y'all! How are you?")
print(output)
print(old_tokenizer.decode(output, skip_special_tokens=True))

[101, 8667, 117, 194, 112, 1155, 106, 1731, 1132, 1128, 136, 102]
Hello, y'all! How are you?

@n1t0
Copy link
Member

n1t0 commented Feb 14, 2020

Hi @tomhosking! Thank you for this bug report! The first part was indeed a bug caused by a typo that was fixed here.

Considering the second part, there is indeed a last step to the decoding, that we do in transformers and not in tokenizers. We are in the process of adding this with #147.

Both of these should be included in the next release that you can expect probably tomorrow or early next week.

@tomhosking
Copy link
Author

Awesome, thanks for the speedy response! And great work on the library as a whole, the offsets in particular are super useful!

@stefan-it
Copy link

Hi @n1t0 does this bug also affects created BERT vocabs (using the training script) - e.g. I created a BERT-compatible vocab for the upcoming Turkish BERT model 🤔

@n1t0
Copy link
Member

n1t0 commented Feb 14, 2020

Hi @stefan-it. These issues shouldn't have any effect for you. Only the .decode() part was affected, with subword prefixes not being removed and thus words not being constructed back to their original form. It does not affect vocabularies at all.

@n1t0 n1t0 closed this as completed in #147 Feb 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants