-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataloader updates #69
Conversation
besides doc strings, we should write a github issue on this talking about the change. I will create one and link it back here. |
I'm not finding either of these changes in the diff, searching for "pop" or "del" |
@nairbv Davis would need to update the PR description. But in short, we found that none of the things we discussed/suspected mattered, all of those made zero difference on the perf. We did find the root cause and fixed it. I am writing a github issue on it and will share. |
My bad, not sure how those didn't make it in properly. Added! |
@daviswer can you check the test failures? |
Turns out that the way we seeded the LCG destroys the bijectivity of the mapping. Removed for now, I'll try and figure out something else to prevent every shuffle from being the same. |
LCG is now properly stateful, as the stateless version was producing cyclical sawtooth patterns, and properly seeded and randomized across workers. The tradeoff is that we're now back to workers owning contiguous document ranges within shard files. |
Add tempfile-based dataloader unit tests from closed repo.
Update StreamingDocDataset and PreloadBufferDataset for hygiene (no more repeated popping from long list, explicitly
del
expired readers)Compress StreamingDocDataset's massive docset data structure into list of dataset/shard/(docid ranges)
Implement LCG-based random mapping as a pseudo-shuffle (since we're no longer materializing an ordered list of docids)