You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training a Model starts by computing the word counts from the training corpus, in the Tokenizer. We later provide these word counts to the relevant Trainer in order to start the training. This actually has several limitations:
Computing the word counts is not always the best starting point to train a Model
This prevents streaming the corpus directly in the Trainer, while forcing us to build a first representation in memory. This is limiting for big datasets. Sometimes Trainers can directly build a better representation, effectively reducing the memory footprint.
Goal
Change the Trainer API to:
Feed it with &str directly
Leave it the responsibility to build its own representation
train should just take the Model to train.
The text was updated successfully, but these errors were encountered:
Current state
Training a
Model
starts by computing the word counts from the training corpus, in theTokenizer
. We later provide these word counts to the relevantTrainer
in order to start the training. This actually has several limitations:Model
Trainer
, while forcing us to build a first representation in memory. This is limiting for big datasets. SometimesTrainer
s can directly build a better representation, effectively reducing the memory footprint.Goal
Change the
Trainer
API to:&str
directlytrain
should just take theModel
to train.The text was updated successfully, but these errors were encountered: