Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader updates #69

Merged
merged 20 commits into from
Apr 10, 2024
Merged

Dataloader updates #69

merged 20 commits into from
Apr 10, 2024

Conversation

daviswer
Copy link
Collaborator

@daviswer daviswer commented Apr 4, 2024

Add tempfile-based dataloader unit tests from closed repo.

Update StreamingDocDataset and PreloadBufferDataset for hygiene (no more repeated popping from long list, explicitly del expired readers)

Compress StreamingDocDataset's massive docset data structure into list of dataset/shard/(docid ranges)

Implement LCG-based random mapping as a pseudo-shuffle (since we're no longer materializing an ordered list of docids)

@daviswer daviswer requested a review from lchu6 April 4, 2024 19:55
@daviswer daviswer self-assigned this Apr 4, 2024
@lchu6 lchu6 marked this pull request as ready for review April 5, 2024 20:02
@lchu6
Copy link
Contributor

lchu6 commented Apr 5, 2024

besides doc strings, we should write a github issue on this talking about the change.

I will create one and link it back here.

@lchu6 lchu6 marked this pull request as draft April 5, 2024 20:03
@daviswer daviswer marked this pull request as ready for review April 5, 2024 20:04
@nairbv
Copy link
Contributor

nairbv commented Apr 5, 2024

(no more repeated popping from long list, explicitly del expired readers)

I'm not finding either of these changes in the diff, searching for "pop" or "del"

@lchu6
Copy link
Contributor

lchu6 commented Apr 5, 2024

@nairbv Davis would need to update the PR description.

But in short, we found that none of the things we discussed/suspected mattered, all of those made zero difference on the perf. We did find the root cause and fixed it. I am writing a github issue on it and will share.

@lchu6
Copy link
Contributor

lchu6 commented Apr 5, 2024

@nairbv a quick draft here: #70

@daviswer
Copy link
Collaborator Author

daviswer commented Apr 5, 2024

(no more repeated popping from long list, explicitly del expired readers)

I'm not finding either of these changes in the diff, searching for "pop" or "del"

My bad, not sure how those didn't make it in properly. Added!

@lchu6
Copy link
Contributor

lchu6 commented Apr 8, 2024

@daviswer can you check the test failures?

@daviswer
Copy link
Collaborator Author

daviswer commented Apr 8, 2024

Turns out that the way we seeded the LCG destroys the bijectivity of the mapping. Removed for now, I'll try and figure out something else to prevent every shuffle from being the same.

@daviswer
Copy link
Collaborator Author

daviswer commented Apr 8, 2024

LCG is now properly stateful, as the stateless version was producing cyclical sawtooth patterns, and properly seeded and randomized across workers. The tradeoff is that we're now back to workers owning contiguous document ranges within shard files.

@lchu6 lchu6 merged commit 3458bd6 into main Apr 10, 2024
3 checks passed
@daviswer daviswer deleted the dataloader_updates branch April 10, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants