-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocessing with any DataPipe writing to local file #144
Comments
cc: @Nayef211 |
Thanks @ejguan for creating the issue. This is blocking distributed training for torchtext training recipes which uses torchdata backed dataset. I will follow the issue and hopefully we can get a fix for this soon! |
One option to support backward compatibility (new DataPipe with old DataLoader) is to modify Edit: Then, we don't need to wait until DataLoader2 coming out. |
This seems like a promising approach. Do you guys have any plans to work on this in the foreseeable future? I can also try to contribute once I have more bandwidth! |
Feel free to grab it. I might not have time to add it in the next a couple of weeks. |
Sorry that I haven't found time to work on this. Stuck with tasks for DataLoader2 |
It seems like we need something within DataLoader to handle all these locks across writers of different DataPipes? I think wrapping a mutex around a single |
A Another solution would be letting Dataloader to make sure this kind of DataPipe is non-shardable. Then, there will be only a single instance across processes. And, the result of this DataPipe would be sent to other processes. @parmeet, @Nayef211 and @hudeven , as a work around to prevent DataLoader2 or mutex blocking the users of TorchText, could we add an extra logic in each Dataset's function to directly download the dataset to bypass |
@hudeven is looking into the solution in this diff D35459528. Basically it is using io_path's file locking mechanism which internally depends on portalocker. would this be a viable cross-platform solution?
Sorry, I don't fully follow this. Since each process is instantiating and executing the dataset, I guess the problem of multiple processes writing to the same file would still be there if if we do direct download right? cc: @NicolasHug Wondering if you are also encountering into this issue with DDP and vision datasets when multiple processes trying to write to the same cache file? |
Thanks for the ping Parmeet. I haven't encountered this issue thus far, because torchvision datasets do not write anything to disk. |
I think if we can incorporate that into the |
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
… twice" Fixes #144 [ghstack-poisoned]
…ache downloading twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
…ache downloading twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
…ache downloading twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
🐛 Describe the bug
We need to take extra care all DataPipe that would write to file system when DataLoader2 triggered multiprocessing. If the file name on the local file system is same across multiple processes, it would be a racing condition.
This is found when TorchText team is using
on_disk_cache
to cache file.DataLoader needs to know such DataPipe must be sharded with multiprocessing or enforce it into single process.
As a workaround, users have to download the file to local file system to prevent writing within DataPipe.
Versions
main branch
The text was updated successfully, but these errors were encountered: