-
Notifications
You must be signed in to change notification settings - Fork 530
Conversation
Codecov Report
|
Codecov Report
@@ Coverage Diff @@
## master #732 +/- ##
==========================================
- Coverage 90.32% 81.25% -9.08%
==========================================
Files 64 64
Lines 6077 6064 -13
==========================================
- Hits 5489 4927 -562
- Misses 588 1137 +549
|
Job PR-732/1 is complete. |
ced71c9
to
6e3038a
Compare
Job PR-732/5 is complete. |
Job PR-732/6 is complete. |
@rich-junwang you might be interested in this |
Thanks, this is helpful! |
@rich-junwang feel free to review the code and leave comments |
Job PR-732/7 is complete. |
f1f7272
to
aad6989
Compare
@kenjewu @eric-haibin-lin please take a look at the refactored BERTVocab. The only user-visible change is that the static default UNKNOWN_TOKEN, etc attributes are moved away from the BERTVocab class. Conceptually it doesn't seem to make sense to expose them as part of the object, given that the object may not be using the default values. If you think this should be preserved, I can add it back. |
@eric-haibin-lin @leezu if the attributes can be flexibly registered, do we still need BERTVocab? |
Only for backwards compatibility; I'm fine with breaking it, but we'll have to replace the Bert artifacts on S3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit does not change any behavior, but only clarifies the code.
This restores the previous behavior where duplicate special tokens are not counted as duplicate reserved tokens. Further a test is added to guarantee future changes will preserve the behavior. The inadvertent change was caught by the test_toy in scripts/tests
This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749
ValueError("'<special>' is not part of the vocabulary. 'special_token' cannot identify a non-existing token.",)
@szha last commit introduces the registration of special tokens via keyword arguments as discussed. For deprecation of padding, bos, eos_tokens I'll open a separate PR. You may want to take a look at the last commit @eric-haibin-lin |
Job PR-732/32 is complete. |
- Allow specification of special tokens as keyword arguments - Allow users to specify the order of the vocabulary index
Currently the
gluonnlp.Vocab
always assigns 0 as id of<unk>
(#393). Furtherreserved tokens are not exposed as attributes of the vocab (#572). This PR
addresses both issues.
Checklist
Essentials
Changes
token_to_idx
argument togluonnlp.Vocab
which allows users tooptionally specify the index to be associated to a token for a subset of
tokens in the vocabulary. (Fixes gluonnlp.Vocab hard coded id of <unk> token #393)
identifiers_to_tokens
argument togluonnlp.Vocab
which allows users tooptionally specify a Mapping, specifying identifiers for tokens that will be
exposed as attributes of the vocabulary (Fixes [Vocab] Support special token registration #572)
Comments
token_to_idx
argument can only contain tokens that will actually be partof the vocabulary. On the one hand, a user may not know which tokens will be
part of the vocabulary, eg. when specifying a maximum size argument. On the
other hand, if the user does not know if a certain token will be part of the
vocab or not, it seems unlikely that the user would like to pre-specify the
index of that token. Thus this shouldn't pose any problems.