"Solution" to memory hogging in train_new_from_iterator with a hack #1546

morphpiece · 2024-06-04T19:46:17Z

Hi

So I was training a new tokenizer from Llama Tokenizer (meta-llama/Llama-2-7b-hf), on a medium sized corpus (Fineweb-10BT sample : 15 million documents with average length of 2300 characters). After the first step of "Pre-processing sequences", the "tokenize words" step would take 1+ hour and I ran out of RAM (780GB). I distinctly remember that when I trained similar sized (but different) corpus few days back, this step would take only around 1 minute.

After going through all the help I could find on internet here, here, and here, and changing the server (upgrading RAM) multiple times, nothing worked. Finally I found that I had used a different old-tokenizer "meta-llama/Meta-Llama-3-8B" in my previous runs. Changed it and everything started working with same procesing time (~1 mnt) and no memory hogging.

Not exactly sure why this matters, but putting it here for someone more experienced to look into it and hopefully it helps someone.

ArthurZucker · 2024-06-05T07:26:18Z

Thanks!
Probably because the new tokenizer does not use the normalizer / uses only a pre-tokenizer? Not super sure but thanks for sharing! 🤗

purefall · 2024-06-07T11:59:32Z

Hi, I am also having a similar issue while training a tokenizer using on Refined Web Dataset.

Similar to issue, I am using the train_from_iterator() function and training a SentencePieceBPETokenizer(), and during the Pre-processing sequences I ran out-of-memory.

I don't believe switching to another tokenizer serves my needs and, it would be great if you could provide some insights on what I can do if I want to train on large datasets.

morphpiece · 2024-06-07T18:32:02Z

Can you give a short script to reproduce the problem?

purefall · 2024-06-09T17:54:55Z

Sure the code below throws a out-of-memory error, process_files is the path list of arrow files downloaded from HF of RefinedWeb Dataset:

from datasets import load_dataset
data_files = {"train": process_files}

dataset = load_dataset("arrow", data_files=data_files, split="train", streaming=True)

# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None, batch_size=1000):
    for elem in dataset.iter(batch_size=batch_size):
        yield elem["content"]


tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
    iterator=batch_iterator(),
    vocab_size=30_000,
    min_frequency=5,
    show_progress=True,
    limit_alphabet=500,
)

morphpiece · 2024-06-14T09:05:35Z

I just ran the script and didn't observe any OOM.

I used a 30 vCPU/240 GB AMD-Epyc server. The memory consumption increased linearly during count_pairs stage and the max was about 60%.

Caveats : I used FineWeb-Edu-10BT sample, saved as parquet files instead of arrow.

Hope that helps.

ArthurZucker · 2024-06-21T08:23:27Z

There is an ongoing PR for arrow support, if that helps we should def merge it!

github-actions · 2024-07-22T01:56:05Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jul 22, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 28, 2024

ArthurZucker reopened this Jul 31, 2024

github-actions bot removed the Stale label Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Solution" to memory hogging in train_new_from_iterator with a hack #1546

"Solution" to memory hogging in train_new_from_iterator with a hack #1546

morphpiece commented Jun 4, 2024

ArthurZucker commented Jun 5, 2024

purefall commented Jun 7, 2024

morphpiece commented Jun 7, 2024

purefall commented Jun 9, 2024

morphpiece commented Jun 14, 2024

ArthurZucker commented Jun 21, 2024

github-actions bot commented Jul 22, 2024

"Solution" to memory hogging in train_new_from_iterator with a hack #1546

"Solution" to memory hogging in train_new_from_iterator with a hack #1546

Comments

morphpiece commented Jun 4, 2024

ArthurZucker commented Jun 5, 2024

purefall commented Jun 7, 2024

morphpiece commented Jun 7, 2024

purefall commented Jun 9, 2024

morphpiece commented Jun 14, 2024

ArthurZucker commented Jun 21, 2024

github-actions bot commented Jul 22, 2024