Decoding with BertWordPieceTokenizer doesn't combine subword tokens #145

tomhosking · 2020-02-13T11:03:29Z

The BertWordPieceTokenizer doesn't seem to correctly combine subword tokens when decoding. For example:

from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer('./data/bert-vocabs/bert-base-cased-vocab.txt', lowercase=False)
output = tokenizer.encode('The word juggler is not in the vocab')
print(tokenizer.decode(output.ids))

The word j ##ug ##gler is not in the v ##oc ##ab

I would have expected this operation to be reversible, and to return the original input string with the subword tokens combined into a single word?

Using Python 3.6.9, tokenizers==0.4.2

The text was updated successfully, but these errors were encountered:

tomhosking · 2020-02-13T11:16:10Z

In fact, decoding behaves differently to transformers==2.3.0 for punctuation as well, even though the tokens are the same:

from tokenizers import BertWordPieceTokenizer
tok = BertWordPieceTokenizer('./data/bert-vocabs/bert-base-cased-vocab.txt', lowercase=False)
output = tok.encode("Hello, y'all! How are you?")
print(output.ids)
print(tok.decode(output.ids))

[101, 8667, 117, 194, 112, 1155, 106, 1731, 1132, 1128, 136, 102]
Hello , y ' all ! How are you ?

from transformers import BertTokenizer
old_tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
output = old_tokenizer.encode("Hello, y'all! How are you?")
print(output)
print(old_tokenizer.decode(output, skip_special_tokens=True))

[101, 8667, 117, 194, 112, 1155, 106, 1731, 1132, 1128, 136, 102]
Hello, y'all! How are you?

n1t0 · 2020-02-14T00:26:05Z

Hi @tomhosking! Thank you for this bug report! The first part was indeed a bug caused by a typo that was fixed here.

Considering the second part, there is indeed a last step to the decoding, that we do in transformers and not in tokenizers. We are in the process of adding this with #147.

Both of these should be included in the next release that you can expect probably tomorrow or early next week.

tomhosking · 2020-02-14T09:50:56Z

Awesome, thanks for the speedy response! And great work on the library as a whole, the offsets in particular are super useful!

stefan-it · 2020-02-14T10:13:17Z

Hi @n1t0 does this bug also affects created BERT vocabs (using the training script) - e.g. I created a BERT-compatible vocab for the upcoming Turkish BERT model 🤔

n1t0 · 2020-02-14T12:09:45Z

Hi @stefan-it. These issues shouldn't have any effect for you. Only the .decode() part was affected, with subword prefixes not being removed and thus words not being constructed back to their original form. It does not affect vocabularies at all.

n1t0 mentioned this issue Feb 14, 2020

Wordpiece Decoder cleanup #147

Merged

n1t0 closed this as completed in #147 Feb 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding with BertWordPieceTokenizer doesn't combine subword tokens #145

Decoding with BertWordPieceTokenizer doesn't combine subword tokens #145

tomhosking commented Feb 13, 2020 •

edited

Loading

tomhosking commented Feb 13, 2020 •

edited

Loading

n1t0 commented Feb 14, 2020

tomhosking commented Feb 14, 2020

stefan-it commented Feb 14, 2020

n1t0 commented Feb 14, 2020

Decoding with BertWordPieceTokenizer doesn't combine subword tokens #145

Decoding with BertWordPieceTokenizer doesn't combine subword tokens #145

Comments

tomhosking commented Feb 13, 2020 • edited Loading

tomhosking commented Feb 13, 2020 • edited Loading

n1t0 commented Feb 14, 2020

tomhosking commented Feb 14, 2020

stefan-it commented Feb 14, 2020

n1t0 commented Feb 14, 2020

tomhosking commented Feb 13, 2020 •

edited

Loading

tomhosking commented Feb 13, 2020 •

edited

Loading