Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dataloader countfile dependency #100

Merged
merged 25 commits into from
Jul 30, 2024

Conversation

daviswer
Copy link
Collaborator

@daviswer daviswer commented Jul 26, 2024

Original version of the dataloader uses a metadata file of documents per fileshard, to determine fractional file ownership without loading any of the actual files. This was important when we were pulling files from a mounted COS bucket, and opening the file would trigger a multi-GB file download. However, having to generate a new count file for every dataset we support is... annoying, especially as we move to upstream this and also support wider varieties of input formats.

This PR changes the countfile logic to instead search for the countfile, and if it does not exist, each worker will count up manually, by touching only the shard files that it owns. For datasets on disk this actually greatly accelerates setup (.046s setup to .0033s for 3.7Tb fineweb on vela, lol), as we no longer have to iterate through each line of the countfile, and length metadata is easily available for pyarrow/parquet/etc. formats. In future, we may be able to remove countfile logic entirely.

Preserves backward compatibility. Builds on #99

@daviswer daviswer requested a review from lchu6 July 26, 2024 20:06
Preprocess_Dataset,
Sampling_Dataset,
Scalable_Shard_Dataset,
BufferDataset,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to do the reformat in a separate PR to make this particular logic change easier to find/read

Copy link
Collaborator Author

@daviswer daviswer Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah this is really only meant to be reviewed in context of #99. I can open a new PR without this dependency, but it'll introduce conflicts later on, and #99 is a large enough set of code changes that I don't want to add that possible complication

@lchu6
Copy link
Contributor

lchu6 commented Jul 29, 2024

I think we should, fully tested this PR for regression accuracy (remove count file + use this PR vs. having count file w/o this PR), and completely remove the count file dependency. i.e. instead of "use count file when exist", we just always assume no count file.

I don't think any of our data pipeline generate count file automatically (we always had to ask for it), and other data sources does not have this file either.

@lchu6
Copy link
Contributor

lchu6 commented Jul 29, 2024

@daviswer so if we are to compare:
a. this PR + remove count file
b. using count file

should both yields exactly same order of data retrieval ?

@daviswer
Copy link
Collaborator Author

Yes data loading behavior should be identical in both cases. I left in the countfile option in case we ever need to support COS streaming again but I could see it either way

@lchu6 lchu6 merged commit 667f0fd into foundation-model-stack:main Jul 30, 2024
3 checks passed
@daviswer daviswer deleted the data-no-countfile branch July 30, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants