-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Solution" to memory hogging in train_new_from_iterator with a hack #1546
Comments
Thanks! |
Hi, I am also having a similar issue while training a tokenizer using on Refined Web Dataset. Similar to issue, I am using the train_from_iterator() function and training a SentencePieceBPETokenizer(), and during the Pre-processing sequences I ran out-of-memory. I don't believe switching to another tokenizer serves my needs and, it would be great if you could provide some insights on what I can do if I want to train on large datasets. |
Can you give a short script to reproduce the problem? |
Sure the code below throws a out-of-memory error, process_files is the path list of arrow files downloaded from HF of RefinedWeb Dataset:
|
I just ran the script and didn't observe any OOM. I used a 30 vCPU/240 GB AMD-Epyc server. The memory consumption increased linearly during count_pairs stage and the max was about 60%. Caveats : I used FineWeb-Edu-10BT sample, saved as parquet files instead of arrow. Hope that helps. |
There is an ongoing PR for arrow support, if that helps we should def merge it! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hi
So I was training a new tokenizer from Llama Tokenizer (meta-llama/Llama-2-7b-hf), on a medium sized corpus (Fineweb-10BT sample : 15 million documents with average length of 2300 characters). After the first step of "Pre-processing sequences", the "tokenize words" step would take 1+ hour and I ran out of RAM (780GB). I distinctly remember that when I trained similar sized (but different) corpus few days back, this step would take only around 1 minute.
After going through all the help I could find on internet here, here, and here, and changing the server (upgrading RAM) multiple times, nothing worked. Finally I found that I had used a different old-tokenizer "meta-llama/Meta-Llama-3-8B" in my previous runs. Changed it and everything started working with same procesing time (~1 mnt) and no memory hogging.
Not exactly sure why this matters, but putting it here for someone more experienced to look into it and hopefully it helps someone.
The text was updated successfully, but these errors were encountered: