-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unify data loading from HF and from disk #287
Conversation
[ghstack-poisoned]
ghstack-source-id: 63f2c0e8ab7d5c74db1e40ac51598c5df0e50039 Pull Request resolved: #287
[ghstack-poisoned]
ghstack-source-id: a8f52287ade9d0ebeafa5e113eacb46c37745de0 Pull Request resolved: #287
[ghstack-poisoned]
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: #287
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see comment
if dataset_name == "c4": | ||
# c4 is huge, and requires both streaming and language selection | ||
# (we default to en) | ||
ds = load_dataset(dataset_path, name="en", split="train", streaming=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a streaming
as an arg the HuggingFaceDataset, and then when creating dataloader in c4, we just pass streaming=true
to the HuggingFaceDataset constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly can, although this would create if-c4 statements to other places as well and more args in function call in general. Since we only have two datasets right now, maybe let's keep it until the change is necessary.
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: #287
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287
ghstack-source-id: 932e7cce828a15c788b34f07c264e119068777fe Pull Request resolved: pytorch#287
Stack from ghstack (oldest at bottom):
As titled. We can just use the
load_dataset
HF API to unify different use cases.load_dataset
is flexible in that, it can take a HF hub dataset repository or a local directory. The behavior is consistent as long as the underlying data is the same. It supports common data formats such as .txt, .json, .json.gz, .csv, .parquet, etc.load_from_disk
can only load dataset saved bysave_to_disk
(in arrow format), which can be viewed as a way to load "preprocessed" dataset:streaming=True
forload_dataset
, even if it is stored in a local directory. One might thinkload_from_disk
is better because of point 3 above; however, to preprocess the huge dataset and callsave_to_disk
, one needs to load it in memory in the first place.For all the reasons listed above, let's not use
load_from_disk
which assumes preprocessed data in arrow format.Let's use
load_dataset
which supports common data formats, and setstreaming=True
for large dataset, no matter it is from HF or from local disk.P.S.:
arrow
tojson
, while keeping the same data (first 45,000 entries ofc4
).c4
is now available to run large scale experiments. Performance verified.