Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix word level tokenizer determinism #718

Merged
merged 3 commits into from
Aug 13, 2021

Conversation

lucacampanella
Copy link
Contributor

Fixes #717
If two word level tokens have the same counts, their order is determined by the token itself alphabetically. This allows for determinism in the resulting tokenizer.

I have tested the code only in the online playground of rust (here), as setting up everything on my system would take too much time. I thus didn't run any tests.

Thanks for taking a look at this. :)

@lucacampanella
Copy link
Contributor Author

Hi @n1t0, did you have a chance to have a look at this?
Thanks a lot :)

@n1t0 n1t0 force-pushed the fix_word_level_determinism branch from 5e35944 to 9538bde Compare August 13, 2021 14:21
Copy link
Member

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for taking care of this @lucacampanella!

I just fixed a few Clippy warnings and added a missing import at the beginning. Everything is working perfectly.

@n1t0 n1t0 merged commit e7dd643 into huggingface:master Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WordLevelTrainer not deterministic
2 participants