unify data loading from HF and from disk #287

tianyu-l · 2024-04-30T19:24:19Z

Stack from ghstack (oldest at bottom):

-> unify data loading from HF and from disk #287

As titled. We can just use the load_dataset HF API to unify different use cases.

load_dataset is flexible in that, it can take a HF hub dataset repository or a local directory. The behavior is consistent as long as the underlying data is the same. It supports common data formats such as .txt, .json, .json.gz, .csv, .parquet, etc.
According to this post,

load_dataset works in three steps: download the dataset, then prepare it as an arrow dataset, and finally return a memory mapped arrow dataset. In particular it creates a cache directory to store the arrow data and the subsequent cache files for map.

Previously used load_from_disk can only load dataset saved by save_to_disk (in arrow format), which can be viewed as a way to load "preprocessed" dataset:

load_from_disk directly returns a memory mapped dataset from the arrow file (similar to Dataset.from_file). It doesn't create a cache diretory, instead all the subsequent map calls write in the same directory as the original data.

For large dataset (which cannot fit in memory), we need to set streaming=True for load_dataset, even if it is stored in a local directory. One might think load_from_diskis better because of point 3 above; however, to preprocess the huge dataset and call save_to_disk, one needs to load it in memory in the first place.

For all the reasons listed above, let's not use load_from_disk which assumes preprocessed data in arrow format.

Let's use load_dataset which supports common data formats, and set streaming=True for large dataset, no matter it is from HF or from local disk.

P.S.:

This PR updates the data file from arrow to json, while keeping the same data (first 45,000 entries of c4).
c4 is now available to run large scale experiments. Performance verified.

[ghstack-poisoned]

ghstack-source-id: 63f2c0e8ab7d5c74db1e40ac51598c5df0e50039 Pull Request resolved: #287

[ghstack-poisoned]

ghstack-source-id: a8f52287ade9d0ebeafa5e113eacb46c37745de0 Pull Request resolved: #287

[ghstack-poisoned]

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: #287

wanchaol

please see comment

wanchaol · 2024-04-30T22:10:00Z

torchtitan/datasets/hf_datasets.py

+        if dataset_name == "c4":
+            # c4 is huge, and requires both streaming and language selection
+            # (we default to en)
+            ds = load_dataset(dataset_path, name="en", split="train", streaming=True)


can we add a streaming as an arg the HuggingFaceDataset, and then when creating dataloader in c4, we just pass streaming=true to the HuggingFaceDataset constructor?

We certainly can, although this would create if-c4 statements to other places as well and more args in function call in general. Since we only have two datasets right now, maybe let's keep it until the change is necessary.

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: #287

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287

unify data loading from HF and from disk

d83ffe0

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Apr 30, 2024

unify data loading from HF and from disk

8d73609

ghstack-source-id: 63f2c0e8ab7d5c74db1e40ac51598c5df0e50039 Pull Request resolved: #287

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2024

Update on "unify data loading from HF and from disk"

ab9ce1c

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Apr 30, 2024

unify data loading from HF and from disk

d5e2113

ghstack-source-id: a8f52287ade9d0ebeafa5e113eacb46c37745de0 Pull Request resolved: #287

Update on "unify data loading from HF and from disk"

9a611e2

[ghstack-poisoned]

tianyu-l added a commit that referenced this pull request Apr 30, 2024

unify data loading from HF and from disk

b24ed42

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: #287

tianyu-l requested review from wanchaol, lessw2020 and awgu April 30, 2024 19:57

wanchaol approved these changes Apr 30, 2024

View reviewed changes

tianyu-l merged commit 9a611e2 into gh/tianyu-l/10/base Apr 30, 2024
4 checks passed

tianyu-l added a commit that referenced this pull request Apr 30, 2024

unify data loading from HF and from disk

4e5ffaf

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: #287

tianyu-l deleted the gh/tianyu-l/10/head branch April 30, 2024 22:23

tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024

unify data loading from HF and from disk

ac090ba

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287

philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024

unify data loading from HF and from disk

258f608

ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify data loading from HF and from disk #287

unify data loading from HF and from disk #287

tianyu-l commented Apr 30, 2024 •

edited

Loading

wanchaol left a comment

wanchaol Apr 30, 2024

tianyu-l Apr 30, 2024

unify data loading from HF and from disk #287

unify data loading from HF and from disk #287

Conversation

tianyu-l commented Apr 30, 2024 • edited Loading

wanchaol left a comment

Choose a reason for hiding this comment

wanchaol Apr 30, 2024

Choose a reason for hiding this comment

tianyu-l Apr 30, 2024

Choose a reason for hiding this comment

tianyu-l commented Apr 30, 2024 •

edited

Loading