-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dataloader countfile dependency #100
Remove dataloader countfile dependency #100
Conversation
add 3b config, replace tele configs (for eventual back-porting of non-tele stuff)
Preprocess_Dataset, | ||
Sampling_Dataset, | ||
Scalable_Shard_Dataset, | ||
BufferDataset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be better to do the reformat in a separate PR to make this particular logic change easier to find/read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should, fully tested this PR for regression accuracy (remove count file + use this PR vs. having count file w/o this PR), and completely remove the count file dependency. i.e. instead of "use count file when exist", we just always assume no count file. I don't think any of our data pipeline generate count file automatically (we always had to ask for it), and other data sources does not have this file either. |
@daviswer so if we are to compare: should both yields exactly same order of data retrieval ? |
Yes data loading behavior should be identical in both cases. I left in the countfile option in case we ever need to support COS streaming again but I could see it either way |
Original version of the dataloader uses a metadata file of documents per fileshard, to determine fractional file ownership without loading any of the actual files. This was important when we were pulling files from a mounted COS bucket, and opening the file would trigger a multi-GB file download. However, having to generate a new count file for every dataset we support is... annoying, especially as we move to upstream this and also support wider varieties of input formats.
This PR changes the countfile logic to instead search for the countfile, and if it does not exist, each worker will count up manually, by touching only the shard files that it owns. For datasets on disk this actually greatly accelerates setup (.046s setup to .0033s for 3.7Tb fineweb on vela, lol), as we no longer have to iterate through each line of the countfile, and length metadata is easily available for pyarrow/parquet/etc. formats. In future, we may be able to remove countfile logic entirely.
Preserves backward compatibility. Builds on #99