Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Replace existing WordSplitter with Tokenizers #3361

Merged
merged 5 commits into from
Oct 16, 2019

Conversation

sai-prasanna
Copy link
Contributor

@sai-prasanna sai-prasanna commented Oct 15, 2019

Removes the notion of word splitter and replaces it with tokenizer.

Fixes #3345

Remove WordSplitter and move the existing splitters to tokenizer.
Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sai-prasanna, this is great! Unfortunately, some names still need to be changed. Hopefully that can be done with some sed commands, so it's not too much work.

It'd also be good to have a test to make sure the old configs are still accepted and processed correctly, to make sure this is backwards compatible for people who were using config files. (If you wrote your own entry point and instantiated tokenizers, sorry, you'll have to make some code changes, but that's not the recommended usage.)

allennlp/data/dataset_readers/copynet_seq2seq.py Outdated Show resolved Hide resolved
allennlp/data/dataset_readers/masked_language_modeling.py Outdated Show resolved Hide resolved
allennlp/data/tokenizers/word_tokenizer.py Outdated Show resolved Hide resolved
allennlp/data/tokenizers/word_tokenizer.py Outdated Show resolved Hide resolved
allennlp/data/tokenizers/sentence_splitter.py Outdated Show resolved Hide resolved
allennlp/data/tokenizers/word_tokenizer.py Outdated Show resolved Hide resolved
allennlp/data/tokenizers/word_tokenizer.py Outdated Show resolved Hide resolved
allennlp/data/tokenizers/word_tokenizer.py Show resolved Hide resolved
allennlp/data/tokenizers/word_tokenizer.py Outdated Show resolved Hide resolved
allennlp/data/tokenizers/word_tokenizer.py Outdated Show resolved Hide resolved
Move legacy handling into Tokenizer.from_params.
And move tokenizer tests to individual files.
Copy link
Contributor

@matt-gardner matt-gardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is awesome!

If there is a config (in sniff tests or in the main tests) that needs start and end tokens with a particular tokenizer, we should add them to that tokenizer. Otherwise, I wouldn't worry about it; it's easy to add them in a later PR if necessary.

One minor change (and possibly another one if I'm understanding your comment about the sniff tests correctly), and then this is good to merge.

allennlp/data/tokenizers/white_space_tokenizer.py Outdated Show resolved Hide resolved
And add it to registry under "whitespace".
@matt-gardner
Copy link
Contributor

Thanks again!

@matt-gardner matt-gardner merged commit 2850579 into allenai:master Oct 16, 2019
reiyw pushed a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
* WIP: Remove splitter

* Convert WordSplitters to Tokenizers

Remove WordSplitter and move the existing splitters to tokenizer.

* Move Tokenizers to separate files.

Move legacy handling into Tokenizer.from_params.

* Add legacy tokenizer loading test.

And move tokenizer tests to individual files.

* Rename white_space_tokenizer to whitespace_tokenizer

And add it to registry under "whitespace".
brendan-ai2 added a commit to allenai/allennlp-hub that referenced this pull request Nov 15, 2019
- For allenai/allennlp#3351.
- Conveniently allenai/allennlp#3361 broke `allennlp_semparse` a while back, so the (AllenNLP Hub Master Build)[http://build.allennlp.org/viewType.html?buildTypeId=AllenNLPHub_Master] should break when this PR merged.
  - We should then fix `allennlp-semparse` and verify that the build goes green.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove WordSplitter
2 participants