Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

[FEATURE] Flexible vocabulary #732

Merged
merged 15 commits into from
Jun 6, 2019
Merged

[FEATURE] Flexible vocabulary #732

merged 15 commits into from
Jun 6, 2019

Conversation

leezu
Copy link
Contributor

@leezu leezu commented May 27, 2019

Currently the gluonnlp.Vocab always assigns 0 as id of <unk> (#393). Further
reserved tokens are not exposed as attributes of the vocab (#572). This PR
addresses both issues.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

Comments

  • The token_to_idx argument can only contain tokens that will actually be part
    of the vocabulary. On the one hand, a user may not know which tokens will be
    part of the vocabulary, eg. when specifying a maximum size argument. On the
    other hand, if the user does not know if a certain token will be part of the
    vocab or not, it seems unlikely that the user would like to pre-specify the
    index of that token. Thus this shouldn't pose any problems.

@leezu leezu requested a review from szha as a code owner May 27, 2019 11:05
@codecov
Copy link

codecov bot commented May 27, 2019

Codecov Report

❗ No coverage uploaded for pull request head (flexiblevocab@af1aeb0). Click here to learn what that means.
The diff coverage is n/a.

@codecov
Copy link

codecov bot commented May 27, 2019

Codecov Report

Merging #732 into master will decrease coverage by 9.07%.
The diff coverage is 94.3%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #732      +/-   ##
==========================================
- Coverage   90.32%   81.25%   -9.08%     
==========================================
  Files          64       64              
  Lines        6077     6064      -13     
==========================================
- Hits         5489     4927     -562     
- Misses        588     1137     +549
Impacted Files Coverage Δ
src/gluonnlp/base.py 87.87% <100%> (ø) ⬆️
src/gluonnlp/vocab/bert.py 98.03% <100%> (+12.99%) ⬆️
src/gluonnlp/embedding/token_embedding.py 88.09% <72.22%> (-2.36%) ⬇️
src/gluonnlp/vocab/vocab.py 97.35% <97.56%> (-0.61%) ⬇️
src/gluonnlp/model/train/cache.py 26.19% <0%> (-71.43%) ⬇️
src/gluonnlp/model/train/language_model.py 42.04% <0%> (-55.12%) ⬇️
src/gluonnlp/embedding/evaluation.py 41.8% <0%> (-54.1%) ⬇️
src/gluonnlp/data/batchify/language_model.py 44.03% <0%> (-52.3%) ⬇️
src/gluonnlp/model/translation.py 20.63% <0%> (-50.8%) ⬇️
src/gluonnlp/model/language_model.py 50.38% <0%> (-49.62%) ⬇️
... and 17 more

@leezu leezu requested a review from eric-haibin-lin May 27, 2019 11:05
@mli
Copy link
Member

mli commented May 27, 2019

Job PR-732/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/1/index.html

@leezu leezu force-pushed the flexiblevocab branch 4 times, most recently from ced71c9 to 6e3038a Compare May 27, 2019 14:32
@mli
Copy link
Member

mli commented May 27, 2019

Job PR-732/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/5/index.html

@szha szha requested a review from astonzhang May 27, 2019 16:37
@mli
Copy link
Member

mli commented May 27, 2019

Job PR-732/6 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/6/index.html

@eric-haibin-lin
Copy link
Member

@rich-junwang you might be interested in this

@rich-junwang
Copy link

@rich-junwang you might be interested in this

Thanks, this is helpful!

@leezu
Copy link
Contributor Author

leezu commented May 28, 2019

@rich-junwang feel free to review the code and leave comments

@mli
Copy link
Member

mli commented May 29, 2019

Job PR-732/7 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/7/index.html

@leezu leezu force-pushed the flexiblevocab branch 4 times, most recently from f1f7272 to aad6989 Compare May 29, 2019 15:48
@leezu leezu requested a review from vanewu May 29, 2019 15:48
@leezu
Copy link
Contributor Author

leezu commented May 29, 2019

@kenjewu @eric-haibin-lin please take a look at the refactored BERTVocab.

The only user-visible change is that the static default UNKNOWN_TOKEN, etc attributes are moved away from the BERTVocab class. Conceptually it doesn't seem to make sense to expose them as part of the object, given that the object may not be using the default values. If you think this should be preserved, I can add it back.

@szha
Copy link
Member

szha commented May 29, 2019

@eric-haibin-lin @leezu if the attributes can be flexibly registered, do we still need BERTVocab?

@leezu
Copy link
Contributor Author

leezu commented May 29, 2019

Only for backwards compatibility; I'm fine with breaking it, but we'll have to replace the Bert artifacts on S3

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha the BERTVocab is no long necessary. But I suggest we keep it in this version for backward compatibility.
@leezu just to double check, this changes doesn't break backward compatibility and users can still use their existing serialized BERTVocab file, right?

@szha szha added the release focus Progress focus for release label May 31, 2019
leezu added 14 commits June 6, 2019 15:55
This commit does not change any behavior, but only clarifies the code.
This restores the previous behavior where duplicate special tokens are not
counted as duplicate reserved tokens. Further a test is added to guarantee
future changes will preserve the behavior.

The inadvertent change was caught by the test_toy in scripts/tests
This adds backwards compatiblity for deserializing some vocabularies serialized
before dmlc#749
ValueError("'<special>' is not part of the vocabulary. 'special_token' cannot
identify a non-existing token.",)
@leezu
Copy link
Contributor Author

leezu commented Jun 6, 2019

@szha last commit introduces the registration of special tokens via keyword arguments as discussed. For deprecation of padding, bos, eos_tokens I'll open a separate PR.

You may want to take a look at the last commit @eric-haibin-lin

@mli
Copy link
Member

mli commented Jun 6, 2019

Job PR-732/32 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-732/32/index.html

@leezu leezu merged commit 1e50a66 into dmlc:master Jun 6, 2019
@leezu leezu deleted the flexiblevocab branch June 6, 2019 18:48
paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019
- Allow specification of special tokens as keyword arguments
- Allow users to specify the order of the vocabulary index
@leezu leezu mentioned this pull request Jun 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
release focus Progress focus for release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Vocab] Support special token registration gluonnlp.Vocab hard coded id of <unk> token
5 participants