Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Remove WordSplitter #3345

Closed
matt-gardner opened this issue Oct 10, 2019 · 5 comments · Fixed by #3361
Closed

Remove WordSplitter #3345

matt-gardner opened this issue Oct 10, 2019 · 5 comments · Fixed by #3361
Assignees
Milestone

Comments

@matt-gardner
Copy link
Contributor

Another candidate idea for simplifying API stuff for a 1.0 release: remove the whole notion of WordSplitter, and just call everything Tokenizers. Basically no one uses the extra stuff that the WordTokenizer has, and it just adds a level of indirection that's unnecessary. If you want the extra filtering and stemming, implement it in a standalone Tokenizer.

@matt-gardner matt-gardner added this to the 1.0.0 milestone Oct 10, 2019
@sai-prasanna
Copy link
Contributor

sai-prasanna commented Oct 13, 2019

Can I do the following changes?

  1. Change the existing wordsplitters to individual tokenizers. SpacyWordSplitter -> SpacyTokenizer , BertWordSplitter -> BertBasicTokenizer etc.
  2. Remove word_stemmer and word_filter code completely. Maybe other users rely on this, but breaking change in 1.0.0 is permitted?
  3. Make SpacyTokenizer the default tokenizer instead of the WordTokenizer.

On a side note can we can get replace the default spacy tokenizer with something that does only tokenization faster (like https://github.com/microsoft/BlingFire). But we do need spacy for pos tagging for POS indexer.

@matt-gardner
Copy link
Contributor Author

Yes, you can do those. This will break sniff tests, though, as it will change config file requirements. A few thoughts:

  1. We probably don't need the OpenAI or BERT tokenizers as separate objects, because we now have the pretrained transformer tokenizer that should let you get the same functionality.
  2. We can probably put some special logic in a custom Tokenizer.from_params method, that checks for "type": "word", pulls out the word splitter, and redirects. Does this make sense? I can give more detail if you need it. This should handle most of the backwards compatibility issues with config files, and it should be possible to make this pass our sniff tests without modifying any configs.
  3. It's fine to remove the stemmer and filter stuff. If someone wants that functionality, they can open a PR to put it back in a new Tokenizer. I really doubt many people actually want it.
  4. No strong opinion on the default tokenizer. I'd guess that most people these days are using pretrained transformers for most things, so they should be using a tokenizer that matches.

@matt-gardner
Copy link
Contributor Author

Well, spacy should probably stay the default, so we don't break a whole lot of existing saved models.

@sai-prasanna
Copy link
Contributor

  1. We are using "bert-basic" splitter in the "bert_for_classification" and "bert_pooler" examples.
    It is needed in cases where we use "word_piece_indexer" etc. If we are going to only support doing subword tokenization in the Tokenizer itself, we don't have to convert "bert-basic" or "openai" splitters into tokenizers. This would involve changing "bert_for_classification" and "bert_pooler" fixture models.

  2. If we don't support "bert-basic" etc, we remove the existing "wordpiece" indexer. Maybe in a separate PR (or it could be in this one) also removing pytorch-pretraining-bert.

  3. I can use a check in from_params to make it backward compatible but if it isn't much pain I can change the config files.

@matt-gardner
Copy link
Contributor Author

  1. The bert_for_classification example should definitely use aligned tokenizers, not our mismatched ones. Going back and forth between tokenizations is a waste when all you're doing is classification.

  2. We do want to support mismatched indexing in our pretrained_transformer indexers, where you tokenize by words or whitespace or whatever (or have pre-tokenized input), but want to encode things with a wordpiece transformer. So we'll want to move the logic in the current wordpiece indexer into either the existing pretrained_transformer indexer or a new one that's dedicated to the mismatched case. But, yeah, this could all happen in a separate PR. If there isn't an issue that's tracking this already, opening one for it would be good.

  3. Changing the config files is not something you can do by yourself. You could download all of the models that we have and change the configs, but then we'd have to upload them again. It'd be better to not break this if we can avoid it, and I'm pretty sure we should be able to maintain compatibility for most models.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants