Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unify data loading from HF and from disk #287

Merged
merged 3 commits into from
Apr 30, 2024

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Apr 30, 2024

Stack from ghstack (oldest at bottom):

As titled. We can just use the load_dataset HF API to unify different use cases.

  1. load_dataset is flexible in that, it can take a HF hub dataset repository or a local directory. The behavior is consistent as long as the underlying data is the same. It supports common data formats such as .txt, .json, .json.gz, .csv, .parquet, etc.
  2. According to this post,

load_dataset works in three steps: download the dataset, then prepare it as an arrow dataset, and finally return a memory mapped arrow dataset. In particular it creates a cache directory to store the arrow data and the subsequent cache files for map.

  1. Previously used load_from_disk can only load dataset saved by save_to_disk (in arrow format), which can be viewed as a way to load "preprocessed" dataset:

load_from_disk directly returns a memory mapped dataset from the arrow file (similar to Dataset.from_file). It doesn't create a cache diretory, instead all the subsequent map calls write in the same directory as the original data.

  1. For large dataset (which cannot fit in memory), we need to set streaming=True for load_dataset, even if it is stored in a local directory. One might think load_from_diskis better because of point 3 above; however, to preprocess the huge dataset and call save_to_disk, one needs to load it in memory in the first place.

For all the reasons listed above, let's not use load_from_disk which assumes preprocessed data in arrow format.

Let's use load_dataset which supports common data formats, and set streaming=True for large dataset, no matter it is from HF or from local disk.

P.S.:

  1. This PR updates the data file from arrow to json, while keeping the same data (first 45,000 entries of c4).
  2. c4 is now available to run large scale experiments. Performance verified.

tianyu-l added a commit that referenced this pull request Apr 30, 2024
ghstack-source-id: 63f2c0e8ab7d5c74db1e40ac51598c5df0e50039
Pull Request resolved: #287
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 30, 2024
tianyu-l added a commit that referenced this pull request Apr 30, 2024
ghstack-source-id: a8f52287ade9d0ebeafa5e113eacb46c37745de0
Pull Request resolved: #287
tianyu-l added a commit that referenced this pull request Apr 30, 2024
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe
Pull Request resolved: #287
@tianyu-l tianyu-l requested review from wanchaol, lessw2020 and awgu April 30, 2024 19:57
Copy link
Collaborator

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see comment

if dataset_name == "c4":
# c4 is huge, and requires both streaming and language selection
# (we default to en)
ds = load_dataset(dataset_path, name="en", split="train", streaming=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a streaming as an arg the HuggingFaceDataset, and then when creating dataloader in c4, we just pass streaming=true to the HuggingFaceDataset constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly can, although this would create if-c4 statements to other places as well and more args in function call in general. Since we only have two datasets right now, maybe let's keep it until the change is necessary.

@tianyu-l tianyu-l merged commit 9a611e2 into gh/tianyu-l/10/base Apr 30, 2024
4 checks passed
tianyu-l added a commit that referenced this pull request Apr 30, 2024
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe
Pull Request resolved: #287
@tianyu-l tianyu-l deleted the gh/tianyu-l/10/head branch April 30, 2024 22:23
tianyu-l added a commit to tianyu-l/torchtitan_intern24 that referenced this pull request Aug 16, 2024
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe
Pull Request resolved: pytorch#287
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe
Pull Request resolved: pytorch#287
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants