-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing the vocab size of the trained Unigram model #952
Conversation
I think this PR is nice and easy. However it is a breaking change, do you mind adding a test for this new behavior (so a test that fails without this PR). |
(So we know when we're breaking later) |
I have a question about where we should run I ask this because some of the test files have its path like
If we want to test on |
You can run And |
Oh, I didn't notice the Makefile, thanks! |
implemented test! |
Hey, can you change this test instead : https://github.com/huggingface/tokenizers/blob/master/bindings/python/tests/bindings/test_trainers.py#L191 It's specifically used for testing training with special tokens, it just lacks vocab_size checking. If we can keep the simple test about the simple case it's better. |
I discovered that if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last Nit, otherwise looks very good !
let vocab_size_without_special_tokens = if need_add_unk { | ||
self.vocab_size as usize - self.special_tokens.len() - 1 | ||
} else { | ||
self.vocab_size as usize - self.special_tokens.len() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I should have seen that earlier, do you mind extracting this from the loop (it can be defined on top.
This will likely be optimized away, but it still help readability too.
I can take care of it if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have found it out...
fix #950