Remove WordSplitter #3345

matt-gardner · 2019-10-10T23:44:21Z

Another candidate idea for simplifying API stuff for a 1.0 release: remove the whole notion of WordSplitter, and just call everything Tokenizers. Basically no one uses the extra stuff that the WordTokenizer has, and it just adds a level of indirection that's unnecessary. If you want the extra filtering and stemming, implement it in a standalone Tokenizer.

The text was updated successfully, but these errors were encountered:

sai-prasanna · 2019-10-13T17:44:48Z

Can I do the following changes?

Change the existing wordsplitters to individual tokenizers. SpacyWordSplitter -> SpacyTokenizer , BertWordSplitter -> BertBasicTokenizer etc.
Remove word_stemmer and word_filter code completely. Maybe other users rely on this, but breaking change in 1.0.0 is permitted?
Make SpacyTokenizer the default tokenizer instead of the WordTokenizer.

On a side note can we can get replace the default spacy tokenizer with something that does only tokenization faster (like https://github.com/microsoft/BlingFire). But we do need spacy for pos tagging for POS indexer.

matt-gardner · 2019-10-13T18:03:52Z

Yes, you can do those. This will break sniff tests, though, as it will change config file requirements. A few thoughts:

We probably don't need the OpenAI or BERT tokenizers as separate objects, because we now have the pretrained transformer tokenizer that should let you get the same functionality.
We can probably put some special logic in a custom Tokenizer.from_params method, that checks for "type": "word", pulls out the word splitter, and redirects. Does this make sense? I can give more detail if you need it. This should handle most of the backwards compatibility issues with config files, and it should be possible to make this pass our sniff tests without modifying any configs.
It's fine to remove the stemmer and filter stuff. If someone wants that functionality, they can open a PR to put it back in a new Tokenizer. I really doubt many people actually want it.
No strong opinion on the default tokenizer. I'd guess that most people these days are using pretrained transformers for most things, so they should be using a tokenizer that matches.

matt-gardner · 2019-10-13T18:09:02Z

Well, spacy should probably stay the default, so we don't break a whole lot of existing saved models.

sai-prasanna · 2019-10-13T19:21:17Z

We are using "bert-basic" splitter in the "bert_for_classification" and "bert_pooler" examples.
It is needed in cases where we use "word_piece_indexer" etc. If we are going to only support doing subword tokenization in the Tokenizer itself, we don't have to convert "bert-basic" or "openai" splitters into tokenizers. This would involve changing "bert_for_classification" and "bert_pooler" fixture models.
If we don't support "bert-basic" etc, we remove the existing "wordpiece" indexer. Maybe in a separate PR (or it could be in this one) also removing pytorch-pretraining-bert.
I can use a check in from_params to make it backward compatible but if it isn't much pain I can change the config files.

matt-gardner · 2019-10-13T23:52:59Z

The bert_for_classification example should definitely use aligned tokenizers, not our mismatched ones. Going back and forth between tokenizations is a waste when all you're doing is classification.
We do want to support mismatched indexing in our pretrained_transformer indexers, where you tokenize by words or whitespace or whatever (or have pre-tokenized input), but want to encode things with a wordpiece transformer. So we'll want to move the logic in the current wordpiece indexer into either the existing pretrained_transformer indexer or a new one that's dedicated to the mismatched case. But, yeah, this could all happen in a separate PR. If there isn't an issue that's tracking this already, opening one for it would be good.
Changing the config files is not something you can do by yourself. You could download all of the models that we have and change the configs, but then we'd have to upload them again. It'd be better to not break this if we can avoid it, and I'm pretty sure we should be able to maintain compatibility for most models.

matt-gardner added this to the 1.0.0 milestone Oct 10, 2019

matt-gardner assigned sai-prasanna Oct 13, 2019

sai-prasanna mentioned this issue Oct 15, 2019

Replace existing WordSplitter with Tokenizers #3361

Merged

matt-gardner closed this as completed in #3361 Oct 16, 2019

VanillaCappuccino mentioned this issue Dec 12, 2022

Suggestion: Use More Up-to-date Version of AllenNLP? ConvLab/ConvLab-3#102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove WordSplitter #3345

Remove WordSplitter #3345

matt-gardner commented Oct 10, 2019

sai-prasanna commented Oct 13, 2019 •

edited

Loading

matt-gardner commented Oct 13, 2019

matt-gardner commented Oct 13, 2019

sai-prasanna commented Oct 13, 2019

matt-gardner commented Oct 13, 2019

Remove WordSplitter #3345

Remove WordSplitter #3345

Comments

matt-gardner commented Oct 10, 2019

sai-prasanna commented Oct 13, 2019 • edited Loading

matt-gardner commented Oct 13, 2019

matt-gardner commented Oct 13, 2019

sai-prasanna commented Oct 13, 2019

matt-gardner commented Oct 13, 2019

sai-prasanna commented Oct 13, 2019 •

edited

Loading