Skip to content

Commit

Permalink
Improve READMEs for from_pretrained
Browse files Browse the repository at this point in the history
  • Loading branch information
n1t0 committed Aug 31, 2021
1 parent a4d0f3d commit ad7090a
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 0 deletions.
8 changes: 8 additions & 0 deletions bindings/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,14 @@ pip install setuptools_rust
python setup.py install
```

### Load a pretrained tokenizer from the Hub

```python
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")
```

### Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
Expand Down
14 changes: 14 additions & 0 deletions tokenizers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,20 @@ The various steps of the pipeline are:
4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
that, for example, a language model would need, such as special tokens.

### Loading a pretrained tokenizer from the Hub
```rust
use tokenizers::tokenizer::{Result, Tokenizer};

fn main() -> Result<()> {
let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None)?;

let encoding = tokenizer.encode("Hey there!", false)?;
println!("{:?}", encoding.get_tokens());

Ok(())
}
```

### Deserialization and tokenization example

```rust
Expand Down
14 changes: 14 additions & 0 deletions tokenizers/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,20 @@
//! 4. The `PostProcessor`: in charge of post-processing the `Encoding` to add anything relevant
//! that, for example, a language model would need, such as special tokens.
//!
//! ## Loading a pretrained tokenizer from the Hub
//! ```
//! use tokenizers::tokenizer::{Result, Tokenizer};
//!
//! fn main() -> Result<()> {
//! let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None)?;
//!
//! let encoding = tokenizer.encode("Hey there!", false)?;
//! println!("{:?}", encoding.get_tokens());
//!
//! Ok(())
//! }
//! ```
//!
//! ## Deserialization and tokenization example
//!
//! ```no_run
Expand Down
7 changes: 7 additions & 0 deletions tokenizers/src/tokenizer/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -397,6 +397,13 @@ impl Tokenizer {
let content = read_to_string(file)?;
Ok(serde_json::from_str(&content)?)
}
pub fn from_pretrained<S: AsRef<str>>(
identifier: S,
params: Option<FromPretrainedParameters>,
) -> Result<Self> {
let tokenizer_file = from_pretrained(identifier, params)?;
Tokenizer::from_file(tokenizer_file)
}
}

impl std::str::FromStr for Tokenizer {
Expand Down

0 comments on commit ad7090a

Please sign in to comment.