Dataloader updates and streamlining #99

daviswer · 2024-07-25T20:32:44Z

A number of cleanup/streamlining changes with no behavioral impact at the training level. This is mostly preparation for landing in torchdata and upcoming features, particularly support for n_workers > 1.

Modify instantiation so that all path- and rank-dependent setup in a layer is deferred to a new recursive setup fn, which executes immediately before saving/loading/stepping. This allows for modification of rank/path after instantiation.
SamplingDataset and ScalableShardDataset are now implemented as proper _WrapperDatasets (though they still must stack in the same order and location). This reflects the intended modular use of _WrapperDatasets, rather than treating these as edge cases of _StatefulDataset due to their containing multiple sub-iterators. No more pass-through args!
Remove Weird_Separated_Camel_Case naming convention in favor of ProperClassNaming
Update unit tests to reflect above changes
Update dataloader builder to reflect the more modular sampling and scalable datasets
Update PreloadBufferDataset so that when the buffer is too large (i.e. rescaling to a smaller number of workers) it will shrink down to the desired size over time, rather than staying oversized.

Again, no user-facing behavior is changed, these are just preparatory updates for further additions. Backward compatibility with older checkpoints is maintained, and training code runs unaffected.

add 3b config, replace tele configs (for eventual back-porting of non-tele stuff)

lchu6

LG.

regression test passed.

daviswer and others added 18 commits July 22, 2024 15:33

Merge pull request #2 from foundation-model-stack/main

9e72876

add 3b config, replace tele configs (for eventual back-porting of non-tele stuff)

Create new data utils file

ffad2b9

Add v3 (incremental changes), test those, linting

c72692c

call setup when loading but haven't stepped yet

736e431

Remove redundant buggy tracking field

12679ba

Shift _len back to init

225879f

Get sampling probs after setup

301a530

Make setup properly conditional

f61a3db

Add setup to scalable, fix test_reload_epoch sampler call

4b504d1

setup in scalable, not yet wrapper

ee6d0c7

Make scalable and sampler wrappers

4687620

Restore delimiter over condition logic - lambda not picklable

bad408b

Type hints

e344cb7

Linting

02d1e72

Overwrite old dataset file, update tests and constructor

0be1dea

Update docs, cleanup

ae9fc29

Put sampling dataset earlier again?

217a147

Remove Weird_Casing

2b479fd

daviswer requested a review from lchu6 July 25, 2024 20:32

Move sampling back to end (sorry for the crazy diff)

90ec624

daviswer mentioned this pull request Jul 26, 2024

Remove dataloader countfile dependency #100

Merged

daviswer added 3 commits July 29, 2024 12:32

Defer ckp loading to setup (post-rank/path adjustment)

5a1032e

Don't call bool

fd1bba1

Update ckpdataset test to setup() before check

77d46c6

This was referenced Jul 29, 2024

Support continued pretraining #101

Merged

Support on-the-fly tokenization and HF parquet datasets #102

Merged

lchu6 approved these changes Jul 30, 2024

View reviewed changes

lchu6 merged commit ba862d8 into foundation-model-stack:main Jul 30, 2024
3 checks passed

daviswer deleted the data-reorg branch July 30, 2024 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader updates and streamlining #99

Dataloader updates and streamlining #99

daviswer commented Jul 25, 2024 •

edited

Loading

lchu6 left a comment

Dataloader updates and streamlining #99

Dataloader updates and streamlining #99

Conversation

daviswer commented Jul 25, 2024 • edited Loading

lchu6 left a comment

Choose a reason for hiding this comment

daviswer commented Jul 25, 2024 •

edited

Loading