Dataloader updates #69

daviswer · 2024-04-04T19:55:54Z

Add tempfile-based dataloader unit tests from closed repo.

Update StreamingDocDataset and PreloadBufferDataset for hygiene (no more repeated popping from long list, explicitly del expired readers)

Compress StreamingDocDataset's massive docset data structure into list of dataset/shard/(docid ranges)

Implement LCG-based random mapping as a pseudo-shuffle (since we're no longer materializing an ordered list of docids)

lchu6 · 2024-04-05T20:03:05Z

besides doc strings, we should write a github issue on this talking about the change.

I will create one and link it back here.

nairbv · 2024-04-05T20:51:25Z

(no more repeated popping from long list, explicitly del expired readers)

I'm not finding either of these changes in the diff, searching for "pop" or "del"

lchu6 · 2024-04-05T20:53:41Z

@nairbv Davis would need to update the PR description.

But in short, we found that none of the things we discussed/suspected mattered, all of those made zero difference on the perf. We did find the root cause and fixed it. I am writing a github issue on it and will share.

lchu6 · 2024-04-05T21:02:48Z

@nairbv a quick draft here: #70

daviswer · 2024-04-05T22:20:10Z

(no more repeated popping from long list, explicitly del expired readers)

I'm not finding either of these changes in the diff, searching for "pop" or "del"

My bad, not sure how those didn't make it in properly. Added!

lchu6 · 2024-04-08T02:15:03Z

@daviswer can you check the test failures?

daviswer · 2024-04-08T14:29:46Z

Turns out that the way we seeded the LCG destroys the bijectivity of the mapping. Removed for now, I'll try and figure out something else to prevent every shuffle from being the same.

daviswer · 2024-04-08T22:50:14Z

LCG is now properly stateful, as the stateless version was producing cyclical sawtooth patterns, and properly seeded and randomized across workers. The tradeoff is that we're now back to workers owning contiguous document ranges within shard files.

daviswer added 2 commits April 4, 2024 15:53

Compress StreamingDataset's docset via new class

8c75d83

Add (updated) unit tests from closed repo

87536e4

daviswer requested a review from lchu6 April 4, 2024 19:55

daviswer self-assigned this Apr 4, 2024

daviswer and others added 9 commits April 4, 2024 21:28

Typing fixes for new docset

a4ecd3b

Docset extends Sized

31eeb87

Reimpl repetitions

b89b2c7

Update name 'docset_' in valsplit test

ba9da6d

Remove train/val splitting

48caa2d

Merge docset, no more separate class

94e06f8

Integrate LCG shuffling - no separate state or class

ddd7b5c

Black/isort

993033f

Allow seed to affect zcg shuffle

f9565ff

lchu6 marked this pull request as ready for review April 5, 2024 20:02

lchu6 marked this pull request as draft April 5, 2024 20:03

daviswer added 2 commits April 5, 2024 16:04

Cleanup/update docstrings

063ef3c

Blacking

44799a2

daviswer marked this pull request as ready for review April 5, 2024 20:04

Fix get_docid calls

bd2eadb

daviswer added 2 commits April 5, 2024 17:47

Expand / update LCG configs

36e6ad6

Pull in del reader and no pop fixes

b3e818f

Add shard size to LCG seed, reshuffles file to file

1c3edce

Remove seeding from LCG - destroys bijectivity

7f134b9

daviswer added 2 commits April 8, 2024 18:37

Switch to stateful LCG, flexible parameterization

1436224

Seed LCG differently across workers

fc9ce49

lchu6 approved these changes Apr 10, 2024

View reviewed changes

lchu6 merged commit 3458bd6 into main Apr 10, 2024
3 checks passed

daviswer deleted the dataloader_updates branch April 10, 2024 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader updates #69

Dataloader updates #69

daviswer commented Apr 4, 2024 •

edited

Loading

lchu6 commented Apr 5, 2024

nairbv commented Apr 5, 2024

lchu6 commented Apr 5, 2024 •

edited

Loading

lchu6 commented Apr 5, 2024

daviswer commented Apr 5, 2024

lchu6 commented Apr 8, 2024

daviswer commented Apr 8, 2024

daviswer commented Apr 8, 2024 •

edited

Loading

Dataloader updates #69

Dataloader updates #69

Conversation

daviswer commented Apr 4, 2024 • edited Loading

lchu6 commented Apr 5, 2024

nairbv commented Apr 5, 2024

lchu6 commented Apr 5, 2024 • edited Loading

lchu6 commented Apr 5, 2024

daviswer commented Apr 5, 2024

lchu6 commented Apr 8, 2024

daviswer commented Apr 8, 2024

daviswer commented Apr 8, 2024 • edited Loading

daviswer commented Apr 4, 2024 •

edited

Loading

lchu6 commented Apr 5, 2024 •

edited

Loading

daviswer commented Apr 8, 2024 •

edited

Loading