Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the vocab size of the trained Unigram model #952

Merged
merged 6 commits into from
Mar 18, 2022
Merged

Fixing the vocab size of the trained Unigram model #952

merged 6 commits into from
Mar 18, 2022

Conversation

kaisugi
Copy link
Contributor

@kaisugi kaisugi commented Mar 16, 2022

fix #950

@Narsil
Copy link
Collaborator

Narsil commented Mar 16, 2022

I think this PR is nice and easy.

However it is a breaking change, do you mind adding a test for this new behavior (so a test that fails without this PR).
Adding an assert here: https://github.com/huggingface/tokenizers/blob/master/bindings/python/tests/bindings/test_trainers.py#L224
roughly should be enough IMO.

@Narsil
Copy link
Collaborator

Narsil commented Mar 16, 2022

(So we know when we're breaking later)

@kaisugi
Copy link
Contributor Author

kaisugi commented Mar 16, 2022

@Narsil

I have a question about where we should run pytest command. On bindings/python/ or bindings/python/tests?

I ask this because some of the test files have its path like tests/data/xxx while others have data/xxx.

filename = "tests/data/unigram_trained.json"

tokenizer.from_file("data/tokenizer.json")

If we want to test on bindings/python/, all the paths are to be tests/data/xxx.

@Narsil
Copy link
Collaborator

Narsil commented Mar 17, 2022

You can run make test in the bindings to make it run properly. It will download the necessary files too.

And data/ is for the static data needed for tests (and ignored in git). tests/data is where some git saved files live.

@kaisugi
Copy link
Contributor Author

kaisugi commented Mar 17, 2022

Oh, I didn't notice the Makefile, thanks!

@kaisugi
Copy link
Contributor Author

kaisugi commented Mar 17, 2022

implemented test!

@Narsil
Copy link
Collaborator

Narsil commented Mar 17, 2022

Hey, can you change this test instead : https://github.com/huggingface/tokenizers/blob/master/bindings/python/tests/bindings/test_trainers.py#L191 It's specifically used for testing training with special tokens, it just lacks vocab_size checking.

If we can keep the simple test about the simple case it's better.
If we hadn't already a test targeting training with special tokens we would have created one.

@kaisugi
Copy link
Contributor Author

kaisugi commented Mar 17, 2022

I discovered that if unk_token is not included in special_tokens, we have to reduce the vocab size by 1 token, so the new codes take care of both cases

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last Nit, otherwise looks very good !

Comment on lines 153 to 157
let vocab_size_without_special_tokens = if need_add_unk {
self.vocab_size as usize - self.special_tokens.len() - 1
} else {
self.vocab_size as usize - self.special_tokens.len()
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have seen that earlier, do you mind extracting this from the loop (it can be defined on top.
This will likely be optimized away, but it still help readability too.

I can take care of it if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have found it out...

@Narsil Narsil merged commit 1bb9884 into huggingface:master Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Different behaviors in adjusting vocab sizes when training WordPiece and Unigram
2 participants