Replace existing WordSplitter with Tokenizers #3361

sai-prasanna · 2019-10-15T15:29:44Z

Removes the notion of word splitter and replaces it with tokenizer.

Remove WordSplitter and move the existing splitters to tokenizer.

matt-gardner

Thanks @sai-prasanna, this is great! Unfortunately, some names still need to be changed. Hopefully that can be done with some sed commands, so it's not too much work.

It'd also be good to have a test to make sure the old configs are still accepted and processed correctly, to make sure this is backwards compatible for people who were using config files. (If you wrote your own entry point and instantiated tokenizers, sorry, you'll have to make some code changes, but that's not the recommended usage.)

allennlp/data/dataset_readers/copynet_seq2seq.py

allennlp/data/dataset_readers/masked_language_modeling.py

allennlp/data/tokenizers/word_tokenizer.py

allennlp/data/tokenizers/sentence_splitter.py

allennlp/data/tokenizers/word_tokenizer.py

Move legacy handling into Tokenizer.from_params.

And move tokenizer tests to individual files.

matt-gardner

Thanks, this is awesome!

If there is a config (in sniff tests or in the main tests) that needs start and end tokens with a particular tokenizer, we should add them to that tokenizer. Otherwise, I wouldn't worry about it; it's easy to add them in a later PR if necessary.

One minor change (and possibly another one if I'm understanding your comment about the sniff tests correctly), and then this is good to merge.

allennlp/data/tokenizers/white_space_tokenizer.py

And add it to registry under "whitespace".

matt-gardner · 2019-10-16T19:13:00Z

Thanks again!

* WIP: Remove splitter * Convert WordSplitters to Tokenizers Remove WordSplitter and move the existing splitters to tokenizer. * Move Tokenizers to separate files. Move legacy handling into Tokenizer.from_params. * Add legacy tokenizer loading test. And move tokenizer tests to individual files. * Rename white_space_tokenizer to whitespace_tokenizer And add it to registry under "whitespace".

- For allenai/allennlp#3351. - Conveniently allenai/allennlp#3361 broke `allennlp_semparse` a while back, so the (AllenNLP Hub Master Build)[http://build.allennlp.org/viewType.html?buildTypeId=AllenNLPHub_Master] should break when this PR merged. - We should then fix `allennlp-semparse` and verify that the build goes green.

sai-prasanna added 2 commits October 15, 2019 09:14

WIP: Remove splitter

e96ea11

Convert WordSplitters to Tokenizers

f12a886

Remove WordSplitter and move the existing splitters to tokenizer.

matt-gardner reviewed Oct 15, 2019

View reviewed changes

sai-prasanna added 2 commits October 16, 2019 16:27

Move Tokenizers to separate files.

53c32ef

Move legacy handling into Tokenizer.from_params.

Add legacy tokenizer loading test.

3968a76

And move tokenizer tests to individual files.

matt-gardner approved these changes Oct 16, 2019

View reviewed changes

allennlp/data/tokenizers/white_space_tokenizer.py Outdated Show resolved Hide resolved

Rename white_space_tokenizer to whitespace_tokenizer

e112b04

And add it to registry under "whitespace".

matt-gardner merged commit 2850579 into allenai:master Oct 16, 2019

brendan-ai2 mentioned this pull request Nov 2, 2019

Add pretrained models and sniff tests for allennlp_semparse. allenai/allennlp-hub#4

Merged

brendan-ai2 mentioned this pull request Nov 15, 2019

Fix breakage from removing WordSplitter. allenai/allennlp-semparse#11

Merged

DeNeutoy mentioned this pull request Dec 6, 2019

Install via pip misses files from data.tokenizers #3492

Closed

mhagiwara mentioned this pull request Jan 14, 2020

Add config for BERT allenai/allennlp-guide-examples#2

Merged

schmmd mentioned this pull request May 13, 2020

ImportError: SpacyTokenizer #4228

Closed

ys7yoo mentioned this pull request Nov 20, 2020

ImportError: cannot import name 'WordTokenizer' ys7yoo/allennlp_imdb#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace existing WordSplitter with Tokenizers #3361

Replace existing WordSplitter with Tokenizers #3361

sai-prasanna commented Oct 15, 2019 •

edited by matt-gardner

Loading

matt-gardner left a comment

matt-gardner left a comment

matt-gardner commented Oct 16, 2019

Replace existing WordSplitter with Tokenizers #3361

Replace existing WordSplitter with Tokenizers #3361

Conversation

sai-prasanna commented Oct 15, 2019 • edited by matt-gardner Loading

matt-gardner left a comment

Choose a reason for hiding this comment

matt-gardner left a comment

Choose a reason for hiding this comment

matt-gardner commented Oct 16, 2019

sai-prasanna commented Oct 15, 2019 •

edited by matt-gardner

Loading