Fixing the vocab size of the trained Unigram model #952

kaisugi · 2022-03-16T12:55:52Z

Narsil · 2022-03-16T14:57:48Z

I think this PR is nice and easy.

However it is a breaking change, do you mind adding a test for this new behavior (so a test that fails without this PR).
Adding an assert here: https://github.com/huggingface/tokenizers/blob/master/bindings/python/tests/bindings/test_trainers.py#L224
roughly should be enough IMO.

Narsil · 2022-03-16T14:58:07Z

(So we know when we're breaking later)

kaisugi · 2022-03-16T23:52:16Z

@Narsil

I have a question about where we should run pytest command. On bindings/python/ or bindings/python/tests?

I ask this because some of the test files have its path like tests/data/xxx while others have data/xxx.

tokenizers/bindings/python/tests/bindings/test_trainers.py

Line 161 in 4b6055d

filename = "tests/data/unigram_trained.json"

tokenizers/bindings/python/tests/bindings/test_trainers.py

Line 286 in 4b6055d

tokenizer.from_file("data/tokenizer.json")

If we want to test on bindings/python/, all the paths are to be tests/data/xxx.

Narsil · 2022-03-17T08:49:43Z

You can run make test in the bindings to make it run properly. It will download the necessary files too.

And data/ is for the static data needed for tests (and ignored in git). tests/data is where some git saved files live.

kaisugi · 2022-03-17T09:03:32Z

Oh, I didn't notice the Makefile, thanks!

kaisugi · 2022-03-17T09:32:34Z

implemented test!

Narsil · 2022-03-17T09:40:05Z

Hey, can you change this test instead : https://github.com/huggingface/tokenizers/blob/master/bindings/python/tests/bindings/test_trainers.py#L191 It's specifically used for testing training with special tokens, it just lacks vocab_size checking.

If we can keep the simple test about the simple case it's better.
If we hadn't already a test targeting training with special tokens we would have created one.

This reverts commit fb8955c.

kaisugi · 2022-03-17T10:05:13Z

I discovered that if unk_token is not included in special_tokens, we have to reduce the vocab size by 1 token, so the new codes take care of both cases

Narsil

Last Nit, otherwise looks very good !

Narsil · 2022-03-18T10:41:44Z

tokenizers/src/models/unigram/trainer.rs

+            let vocab_size_without_special_tokens = if need_add_unk {
+                self.vocab_size as usize - self.special_tokens.len() - 1
+            } else {
+                self.vocab_size as usize - self.special_tokens.len()
+            };


Sorry I should have seen that earlier, do you mind extracting this from the loop (it can be defined on top.
This will likely be optimized away, but it still help readability too.

I can take care of it if you want.

I should have found it out...

Fixing the vocab size of the trained Unigram model

c75eafc

add test for the vocab size of the trained Unigram model

fb8955c

kaisugi added 2 commits March 17, 2022 18:49

Revert "add test for the vocab size of the trained Unigram model"

1015b6c

This reverts commit fb8955c.

Fixing the vocab size of the trained Unigram model

b28377f

format codes

7004239

Narsil reviewed Mar 18, 2022

View reviewed changes

get the position of vocab-size calculation out of loop

f06220a

Narsil merged commit 1bb9884 into huggingface:master Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing the vocab size of the trained Unigram model #952

Fixing the vocab size of the trained Unigram model #952

kaisugi commented Mar 16, 2022

Narsil commented Mar 16, 2022

Narsil commented Mar 16, 2022

kaisugi commented Mar 16, 2022 •

edited

Loading

Narsil commented Mar 17, 2022 •

edited

Loading

kaisugi commented Mar 17, 2022

kaisugi commented Mar 17, 2022

Narsil commented Mar 17, 2022

kaisugi commented Mar 17, 2022

Narsil left a comment

Narsil Mar 18, 2022

kaisugi Mar 18, 2022

Fixing the vocab size of the trained Unigram model #952

Fixing the vocab size of the trained Unigram model #952

Conversation

kaisugi commented Mar 16, 2022

Narsil commented Mar 16, 2022

Narsil commented Mar 16, 2022

kaisugi commented Mar 16, 2022 • edited Loading

Narsil commented Mar 17, 2022 • edited Loading

kaisugi commented Mar 17, 2022

kaisugi commented Mar 17, 2022

Narsil commented Mar 17, 2022

kaisugi commented Mar 17, 2022

Narsil left a comment

Choose a reason for hiding this comment

Narsil Mar 18, 2022

Choose a reason for hiding this comment

kaisugi Mar 18, 2022

Choose a reason for hiding this comment

kaisugi commented Mar 16, 2022 •

edited

Loading

Narsil commented Mar 17, 2022 •

edited

Loading