Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to instantiate tokenizers from the Hub (from_pretrained) #780

Merged
merged 7 commits into from
Aug 31, 2021

Conversation

n1t0
Copy link
Member

@n1t0 n1t0 commented Aug 19, 2021

Similarly to what exists in transformers, add the ability to instantiate tokenizers from tokenizer.json files uploaded to the Hugging Face Hub.

Example:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained('bert-base-uncased')

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add 1 test per binding + 1 for rust at least ?
It will depend on network+hub which is never great but I think it could be helpful.

Also we probably need to add it to the docs, no ?

fn default() -> Self {
Self {
revision: "main".into(),
user_agent: HashMap::new(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be "tokenizers", "rust" or something by default ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, by default the user agent always contains tokenizers/{RUST_VERSION}, this is for additional items

@n1t0
Copy link
Member Author

n1t0 commented Aug 24, 2021

Shouldn't we add 1 test per binding + 1 for rust at least ?
It will depend on network+hub which is never great but I think it could be helpful.

Agreed

Also we probably need to add it to the docs, no ?

I've already added it in every place I though about. Do you have something specific in mind?

@n1t0 n1t0 force-pushed the from_pretrained branch from 5b66aea to d1992fe Compare August 24, 2021 09:31
@Narsil
Copy link
Collaborator

Narsil commented Aug 24, 2021

I've already added it in every place I though about. Do you have something specific in mind?

I missed it in quicktour that was what I had in mind.

@n1t0 n1t0 merged commit 35c96e5 into master Aug 31, 2021
@n1t0 n1t0 deleted the from_pretrained branch August 31, 2021 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants