-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding with BertWordPieceTokenizer doesn't combine subword tokens #145
Comments
In fact, decoding behaves differently to
|
Hi @tomhosking! Thank you for this bug report! The first part was indeed a bug caused by a typo that was fixed here. Considering the second part, there is indeed a last step to the decoding, that we do in Both of these should be included in the next release that you can expect probably tomorrow or early next week. |
Awesome, thanks for the speedy response! And great work on the library as a whole, the offsets in particular are super useful! |
Hi @n1t0 does this bug also affects created BERT vocabs (using the training script) - e.g. I created a BERT-compatible vocab for the upcoming Turkish BERT model 🤔 |
Hi @stefan-it. These issues shouldn't have any effect for you. Only the |
The
BertWordPieceTokenizer
doesn't seem to correctly combine subword tokens when decoding. For example:I would have expected this operation to be reversible, and to return the original input string with the subword tokens combined into a single word?
Using Python 3.6.9,
tokenizers==0.4.2
The text was updated successfully, but these errors were encountered: