[FEATURE] Flexible vocabulary #732

leezu · 2019-05-27T11:05:00Z

Currently the gluonnlp.Vocab always assigns 0 as id of <unk> (#393). Further
reserved tokens are not exposed as attributes of the vocab (#572). This PR
addresses both issues.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Add token_to_idx argument to gluonnlp.Vocab which allows users to
optionally specify the index to be associated to a token for a subset of
tokens in the vocabulary. (Fixes gluonnlp.Vocab hard coded id of <unk> token #393)
Add identifiers_to_tokens argument to gluonnlp.Vocab which allows users to
optionally specify a Mapping, specifying identifiers for tokens that will be
exposed as attributes of the vocabulary (Fixes [Vocab] Support special token registration #572)

Comments

The token_to_idx argument can only contain tokens that will actually be part
of the vocabulary. On the one hand, a user may not know which tokens will be
part of the vocabulary, eg. when specifying a maximum size argument. On the
other hand, if the user does not know if a certain token will be part of the
vocab or not, it seems unlikely that the user would like to pre-specify the
index of that token. Thus this shouldn't pose any problems.

codecov · 2019-05-27T11:05:02Z

Codecov Report

❗ No coverage uploaded for pull request head (flexiblevocab@af1aeb0). Click here to learn what that means.
The diff coverage is n/a.

codecov · 2019-05-27T11:05:02Z

Codecov Report

Merging #732 into master will decrease coverage by 9.07%.
The diff coverage is 94.3%.

@@            Coverage Diff             @@
##           master     #732      +/-   ##
==========================================
- Coverage   90.32%   81.25%   -9.08%     
==========================================
  Files          64       64              
  Lines        6077     6064      -13     
==========================================
- Hits         5489     4927     -562     
- Misses        588     1137     +549

Impacted Files	Coverage Δ
src/gluonnlp/base.py	`87.87% <100%> (ø)`	⬆️
src/gluonnlp/vocab/bert.py	`98.03% <100%> (+12.99%)`	⬆️
src/gluonnlp/embedding/token_embedding.py	`88.09% <72.22%> (-2.36%)`	⬇️
src/gluonnlp/vocab/vocab.py	`97.35% <97.56%> (-0.61%)`	⬇️
src/gluonnlp/model/train/cache.py	`26.19% <0%> (-71.43%)`	⬇️
src/gluonnlp/model/train/language_model.py	`42.04% <0%> (-55.12%)`	⬇️
src/gluonnlp/embedding/evaluation.py	`41.8% <0%> (-54.1%)`	⬇️
src/gluonnlp/data/batchify/language_model.py	`44.03% <0%> (-52.3%)`	⬇️
src/gluonnlp/model/translation.py	`20.63% <0%> (-50.8%)`	⬇️
src/gluonnlp/model/language_model.py	`50.38% <0%> (-49.62%)`	⬇️
... and 17 more

mli · 2019-05-27T12:09:28Z

Job PR-732/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/1/index.html

mli · 2019-05-27T16:10:00Z

Job PR-732/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/5/index.html

mli · 2019-05-27T19:16:44Z

Job PR-732/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/6/index.html

eric-haibin-lin · 2019-05-28T05:14:46Z

@rich-junwang you might be interested in this

rich-junwang · 2019-05-28T13:02:52Z

@rich-junwang you might be interested in this

Thanks, this is helpful!

leezu · 2019-05-28T13:32:49Z

@rich-junwang feel free to review the code and leave comments

mli · 2019-05-29T13:36:48Z

Job PR-732/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/7/index.html

leezu · 2019-05-29T15:50:53Z

@kenjewu @eric-haibin-lin please take a look at the refactored BERTVocab.

The only user-visible change is that the static default UNKNOWN_TOKEN, etc attributes are moved away from the BERTVocab class. Conceptually it doesn't seem to make sense to expose them as part of the object, given that the object may not be using the default values. If you think this should be preserved, I can add it back.

szha · 2019-05-29T15:57:06Z

@eric-haibin-lin @leezu if the attributes can be flexibly registered, do we still need BERTVocab?

leezu · 2019-05-29T16:31:14Z

Only for backwards compatibility; I'm fine with breaking it, but we'll have to replace the Bert artifacts on S3

eric-haibin-lin

@szha the BERTVocab is no long necessary. But I suggest we keep it in this version for backward compatibility.
@leezu just to double check, this changes doesn't break backward compatibility and users can still use their existing serialized BERTVocab file, right?

src/gluonnlp/vocab/vocab.py

src/gluonnlp/vocab/bert.py

src/gluonnlp/vocab/vocab.py

src/gluonnlp/vocab/bert.py

This commit does not change any behavior, but only clarifies the code.

This restores the previous behavior where duplicate special tokens are not counted as duplicate reserved tokens. Further a test is added to guarantee future changes will preserve the behavior. The inadvertent change was caught by the test_toy in scripts/tests

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

ValueError("'<special>' is not part of the vocabulary. 'special_token' cannot identify a non-existing token.",)

leezu · 2019-06-06T15:57:54Z

@szha last commit introduces the registration of special tokens via keyword arguments as discussed. For deprecation of padding, bos, eos_tokens I'll open a separate PR.

You may want to take a look at the last commit @eric-haibin-lin

src/gluonnlp/vocab/bert.py

mli · 2019-06-06T18:45:50Z

Job PR-732/32 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/32/index.html

- Allow specification of special tokens as keyword arguments - Allow users to specify the order of the vocabulary index

leezu requested a review from szha as a code owner May 27, 2019 11:05

leezu requested a review from eric-haibin-lin May 27, 2019 11:05

leezu force-pushed the flexiblevocab branch 4 times, most recently from ced71c9 to 6e3038a Compare May 27, 2019 14:32

szha requested a review from astonzhang May 27, 2019 16:37

leezu force-pushed the flexiblevocab branch 4 times, most recently from f1f7272 to aad6989 Compare May 29, 2019 15:48

leezu requested a review from vanewu May 29, 2019 15:48

leezu force-pushed the flexiblevocab branch from aad6989 to fa5787e Compare May 29, 2019 15:55

leezu mentioned this pull request May 29, 2019

[MODEL] BERT conversion scripts, SciBERT, BioBERT, ClinicalBERT #735

Merged

eric-haibin-lin reviewed May 30, 2019

View reviewed changes

src/gluonnlp/vocab/vocab.py Outdated Show resolved Hide resolved

eric-haibin-lin reviewed May 30, 2019

View reviewed changes

src/gluonnlp/vocab/bert.py Outdated Show resolved Hide resolved

szha added the release focus Progress focus for release label May 31, 2019

leezu added 14 commits June 6, 2019 15:55

Use fixture for passing counter in test_vocab_embed.py

21feaef

Exposure of vocab.reserved_tokens via user-specified attributes

8945fcc

User specified token index mapping in vocabulary

6aac973

Clarify that UNK_IDX == 0 only affects TokenEmbedding

795deb4

This commit does not change any behavior, but only clarifies the code.

Refactor BERTVocab based on new, flexible vocabulary

a9dfb8b

Address review comments

66a93fc

Handle serialized vocabularies with corrupted index

57cb063

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

Fix BERTVocab backwards compat deserialization code

c43b17f

Remove test of Vocabulary serialization internals

22c780b

Fix docstring example

a48410e

ValueError("'<special>' is not part of the vocabulary. 'special_token' cannot identify a non-existing token.",)

Improve docstring

9e97391

Preserve correct docstring of BERTVocab

a718e1f

Clarify doc

c1c81bc

leezu force-pushed the flexiblevocab branch from b42f412 to 3a3cca2 Compare June 6, 2019 15:55

leezu force-pushed the flexiblevocab branch from 3a3cca2 to d92c457 Compare June 6, 2019 16:03

Allow specification of special tokens as keyword arguments

dab1895

leezu force-pushed the flexiblevocab branch from d92c457 to dab1895 Compare June 6, 2019 17:38

szha reviewed Jun 6, 2019

View reviewed changes

src/gluonnlp/vocab/bert.py Show resolved Hide resolved

szha approved these changes Jun 6, 2019

View reviewed changes

leezu merged commit 1e50a66 into dmlc:master Jun 6, 2019

leezu deleted the flexiblevocab branch June 6, 2019 18:48

leezu mentioned this pull request Jun 7, 2019

CI reports linkcheck as passing for pull-requests even though it fails #751

Closed

paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019

[FEATURE] Flexible vocabulary (dmlc#732)

37ffbee

- Allow specification of special tokens as keyword arguments - Allow users to specify the order of the vocabulary index

leezu mentioned this pull request Jun 10, 2019

Port BioBERT to GluonNLP #714

Closed

leezu mentioned this pull request Aug 3, 2019

[BUGFIX] Fix Vocab with unknown_token remapped to != 0 via token_to_idx arg #862

Merged

5 tasks

leezu mentioned this pull request Sep 24, 2019

[DEV] Deprecate specifying Vocab padding, bos and eos_token as positional arguments #945

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Flexible vocabulary #732

[FEATURE] Flexible vocabulary #732

leezu commented May 27, 2019 •

edited

Loading

codecov bot commented May 27, 2019

codecov bot commented May 27, 2019 •

edited

Loading

mli commented May 27, 2019

mli commented May 27, 2019

mli commented May 27, 2019

eric-haibin-lin commented May 28, 2019

rich-junwang commented May 28, 2019

leezu commented May 28, 2019

mli commented May 29, 2019

leezu commented May 29, 2019

szha commented May 29, 2019

leezu commented May 29, 2019

eric-haibin-lin left a comment

leezu commented Jun 6, 2019

mli commented Jun 6, 2019

[FEATURE] Flexible vocabulary #732

[FEATURE] Flexible vocabulary #732

Conversation

leezu commented May 27, 2019 • edited Loading

Checklist

Essentials

Changes

Comments

codecov bot commented May 27, 2019

Codecov Report

codecov bot commented May 27, 2019 • edited Loading

Codecov Report

mli commented May 27, 2019

mli commented May 27, 2019

mli commented May 27, 2019

eric-haibin-lin commented May 28, 2019

rich-junwang commented May 28, 2019

leezu commented May 28, 2019

mli commented May 29, 2019

leezu commented May 29, 2019

szha commented May 29, 2019

leezu commented May 29, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

leezu commented Jun 6, 2019

mli commented Jun 6, 2019

leezu commented May 27, 2019 •

edited

Loading

codecov bot commented May 27, 2019 •

edited

Loading