-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Replace existing WordSplitter with Tokenizers #3361
Conversation
Remove WordSplitter and move the existing splitters to tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sai-prasanna, this is great! Unfortunately, some names still need to be changed. Hopefully that can be done with some sed commands, so it's not too much work.
It'd also be good to have a test to make sure the old configs are still accepted and processed correctly, to make sure this is backwards compatible for people who were using config files. (If you wrote your own entry point and instantiated tokenizers, sorry, you'll have to make some code changes, but that's not the recommended usage.)
Move legacy handling into Tokenizer.from_params.
And move tokenizer tests to individual files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is awesome!
If there is a config (in sniff tests or in the main tests) that needs start and end tokens with a particular tokenizer, we should add them to that tokenizer. Otherwise, I wouldn't worry about it; it's easy to add them in a later PR if necessary.
One minor change (and possibly another one if I'm understanding your comment about the sniff tests correctly), and then this is good to merge.
And add it to registry under "whitespace".
Thanks again! |
* WIP: Remove splitter * Convert WordSplitters to Tokenizers Remove WordSplitter and move the existing splitters to tokenizer. * Move Tokenizers to separate files. Move legacy handling into Tokenizer.from_params. * Add legacy tokenizer loading test. And move tokenizer tests to individual files. * Rename white_space_tokenizer to whitespace_tokenizer And add it to registry under "whitespace".
- For allenai/allennlp#3351. - Conveniently allenai/allennlp#3361 broke `allennlp_semparse` a while back, so the (AllenNLP Hub Master Build)[http://build.allennlp.org/viewType.html?buildTypeId=AllenNLPHub_Master] should break when this PR merged. - We should then fix `allennlp-semparse` and verify that the build goes green.
Removes the notion of word splitter and replaces it with tokenizer.
Fixes #3345