Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Solution" to memory hogging in train_new_from_iterator with a hack #1546

Open
morphpiece opened this issue Jun 4, 2024 · 7 comments
Open

Comments

@morphpiece
Copy link

Hi

So I was training a new tokenizer from Llama Tokenizer (meta-llama/Llama-2-7b-hf), on a medium sized corpus (Fineweb-10BT sample : 15 million documents with average length of 2300 characters). After the first step of "Pre-processing sequences", the "tokenize words" step would take 1+ hour and I ran out of RAM (780GB). I distinctly remember that when I trained similar sized (but different) corpus few days back, this step would take only around 1 minute.

After going through all the help I could find on internet here, here, and here, and changing the server (upgrading RAM) multiple times, nothing worked. Finally I found that I had used a different old-tokenizer "meta-llama/Meta-Llama-3-8B" in my previous runs. Changed it and everything started working with same procesing time (~1 mnt) and no memory hogging.

Not exactly sure why this matters, but putting it here for someone more experienced to look into it and hopefully it helps someone.

@ArthurZucker
Copy link
Collaborator

Thanks!
Probably because the new tokenizer does not use the normalizer / uses only a pre-tokenizer? Not super sure but thanks for sharing! 🤗

@purefall
Copy link

purefall commented Jun 7, 2024

Hi, I am also having a similar issue while training a tokenizer using on Refined Web Dataset.

Similar to issue, I am using the train_from_iterator() function and training a SentencePieceBPETokenizer(), and during the Pre-processing sequences I ran out-of-memory.

I don't believe switching to another tokenizer serves my needs and, it would be great if you could provide some insights on what I can do if I want to train on large datasets.

@morphpiece
Copy link
Author

Can you give a short script to reproduce the problem?

@purefall
Copy link

purefall commented Jun 9, 2024

Sure the code below throws a out-of-memory error, process_files is the path list of arrow files downloaded from HF of RefinedWeb Dataset:

from datasets import load_dataset
data_files = {"train": process_files}

dataset = load_dataset("arrow", data_files=data_files, split="train", streaming=True)

# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None, batch_size=1000):
    for elem in dataset.iter(batch_size=batch_size):
        yield elem["content"]


tokenizer = SentencePieceBPETokenizer()
tokenizer.train_from_iterator(
    iterator=batch_iterator(),
    vocab_size=30_000,
    min_frequency=5,
    show_progress=True,
    limit_alphabet=500,
)

@morphpiece
Copy link
Author

I just ran the script and didn't observe any OOM.

I used a 30 vCPU/240 GB AMD-Epyc server. The memory consumption increased linearly during count_pairs stage and the max was about 60%.

Caveats : I used FineWeb-Edu-10BT sample, saved as parquet files instead of arrow.

Hope that helps.

@ArthurZucker
Copy link
Collaborator

There is an ongoing PR for arrow support, if that helps we should def merge it!

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 22, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 28, 2024
@ArthurZucker ArthurZucker reopened this Jul 31, 2024
@github-actions github-actions bot removed the Stale label Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants