-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training a Tokenizer from large texts read from memory #465
Comments
Sorry no, not at the moment. How much does you 800M files represent of memory ? How much of that will your "text"represent ? Training algorithms need to have the whole dataset readable to be able to learn (the preprocessing helps not having the whole array in memory if it's possible to do) so you would have to fit the whole thing in memory anyway. If it fits in memory, how come it can't fit on disk ? Usually disk is cheaper than RAM, no ? Just trying to understand your use case so we might design training API better in the future. |
Thanks Narsil. The 800M represent around 800G on disk. I'm not sure about the memory. Since there is no way to load them in memory directly, We can definitely write the files to disk and train a tokenizer. Thanks again for your quick response. |
Also quick tip, usually you don't need to train the tokenizer on the whole huge dataset. 1Go - 10Go should be more than enough to get good heuristics for your tokenizer. Going to the full dataset will only yield very marginally better results. |
Thanks for the tip Narsil. That would really help. |
Duplicate of #198 |
I have around 800M files where the text field is one of the fields in the file. I would like to train a new tokenizer only on the text field. I cannot extract the text and write to new files because of the huge volume of files. Is there anyway to train a new tokenizer on this data by reading only the text fields from each of these files and passing them to the training process?
Thanks,
Ravi.
The text was updated successfully, but these errors were encountered: