Remove dataloader countfile dependency #100

daviswer · 2024-07-26T20:06:37Z

Original version of the dataloader uses a metadata file of documents per fileshard, to determine fractional file ownership without loading any of the actual files. This was important when we were pulling files from a mounted COS bucket, and opening the file would trigger a multi-GB file download. However, having to generate a new count file for every dataset we support is... annoying, especially as we move to upstream this and also support wider varieties of input formats.

This PR changes the countfile logic to instead search for the countfile, and if it does not exist, each worker will count up manually, by touching only the shard files that it owns. For datasets on disk this actually greatly accelerates setup (.046s setup to .0033s for 3.7Tb fineweb on vela, lol), as we no longer have to iterate through each line of the countfile, and length metadata is easily available for pyarrow/parquet/etc. formats. In future, we may be able to remove countfile logic entirely.

Preserves backward compatibility. Builds on #99

add 3b config, replace tele configs (for eventual back-porting of non-tele stuff)

nairbv · 2024-07-29T14:12:53Z

fms_fsdp/utils/dataloader_utils.py

-    Preprocess_Dataset,
-    Sampling_Dataset,
-    Scalable_Shard_Dataset,
+    BufferDataset,


might be better to do the reformat in a separate PR to make this particular logic change easier to find/read

Oh yeah this is really only meant to be reviewed in context of #99. I can open a new PR without this dependency, but it'll introduce conflicts later on, and #99 is a large enough set of code changes that I don't want to add that possible complication

lchu6 · 2024-07-29T19:13:58Z

I think we should, fully tested this PR for regression accuracy (remove count file + use this PR vs. having count file w/o this PR), and completely remove the count file dependency. i.e. instead of "use count file when exist", we just always assume no count file.

I don't think any of our data pipeline generate count file automatically (we always had to ask for it), and other data sources does not have this file either.

lchu6 · 2024-07-29T19:16:01Z

@daviswer so if we are to compare:
a. this PR + remove count file
b. using count file

should both yields exactly same order of data retrieval ?

daviswer · 2024-07-29T20:34:19Z

Yes data loading behavior should be identical in both cases. I left in the countfile option in case we ever need to support COS streaming again but I could see it either way

daviswer and others added 24 commits July 22, 2024 15:33

Merge pull request #2 from foundation-model-stack/main

9e72876

add 3b config, replace tele configs (for eventual back-porting of non-tele stuff)

Create new data utils file

ffad2b9

Add v3 (incremental changes), test those, linting

c72692c

call setup when loading but haven't stepped yet

736e431

Remove redundant buggy tracking field

12679ba

Shift _len back to init

225879f

Get sampling probs after setup

301a530

Make setup properly conditional

f61a3db

Add setup to scalable, fix test_reload_epoch sampler call

4b504d1

setup in scalable, not yet wrapper

ee6d0c7

Make scalable and sampler wrappers

4687620

Restore delimiter over condition logic - lambda not picklable

bad408b

Type hints

e344cb7

Linting

02d1e72

Overwrite old dataset file, update tests and constructor

0be1dea

Update docs, cleanup

ae9fc29

Put sampling dataset earlier again?

217a147

Remove Weird_Casing

2b479fd

Move sampling back to end (sorry for the crazy diff)

90ec624

Begin testing

91b84f9

Quit blocking on missing countfile

6dba519

Time tracking non conditional

d96a78f

Build dataset synchronously for timing testing

10038d3

Remove timing stuff

afbb2dd

daviswer requested a review from lchu6 July 26, 2024 20:06

nairbv reviewed Jul 29, 2024

View reviewed changes

Correct impl of setup() in ckptdataset

e391de5

lchu6 merged commit 667f0fd into foundation-model-stack:main Jul 30, 2024
3 checks passed

daviswer deleted the data-no-countfile branch July 30, 2024 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove dataloader countfile dependency #100

Remove dataloader countfile dependency #100

daviswer commented Jul 26, 2024 •

edited

Loading

nairbv Jul 29, 2024

daviswer Jul 29, 2024 •

edited

Loading

lchu6 commented Jul 29, 2024

lchu6 commented Jul 29, 2024

daviswer commented Jul 29, 2024

Remove dataloader countfile dependency #100

Remove dataloader countfile dependency #100

Conversation

daviswer commented Jul 26, 2024 • edited Loading

nairbv Jul 29, 2024

Choose a reason for hiding this comment

daviswer Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

lchu6 commented Jul 29, 2024

lchu6 commented Jul 29, 2024

daviswer commented Jul 29, 2024

daviswer commented Jul 26, 2024 •

edited

Loading

daviswer Jul 29, 2024 •

edited

Loading