Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing with any DataPipe writing to local file #144

Closed
ejguan opened this issue Dec 18, 2021 · 13 comments
Closed

Multiprocessing with any DataPipe writing to local file #144

ejguan opened this issue Dec 18, 2021 · 13 comments
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed high priority

Comments

@ejguan
Copy link
Contributor

ejguan commented Dec 18, 2021

🐛 Describe the bug

We need to take extra care all DataPipe that would write to file system when DataLoader2 triggered multiprocessing. If the file name on the local file system is same across multiple processes, it would be a racing condition.
This is found when TorchText team is using on_disk_cache to cache file.
DataLoader needs to know such DataPipe must be sharded with multiprocessing or enforce it into single process.

As a workaround, users have to download the file to local file system to prevent writing within DataPipe.

Versions

main branch

@ejguan ejguan added the bug Something isn't working label Dec 20, 2021
@ejguan
Copy link
Contributor Author

ejguan commented Dec 20, 2021

cc: @Nayef211

@Nayef211
Copy link

Thanks @ejguan for creating the issue. This is blocking distributed training for torchtext training recipes which uses torchdata backed dataset. I will follow the issue and hopefully we can get a fix for this soon!

@ejguan
Copy link
Contributor Author

ejguan commented Jan 13, 2022

One option to support backward compatibility (new DataPipe with old DataLoader) is to modify Saver with a mutex or wait until a process has finished writing to the file.

Edit: Then, we don't need to wait until DataLoader2 coming out.

@Nayef211
Copy link

One option to support backward compatibility (new DataPipe with old DataLoader) is to modify Saver with a mutex or wait until a process has finished writing to the file.

Edit: Then, we don't need to wait until DataLoader2 coming out.

This seems like a promising approach. Do you guys have any plans to work on this in the foreseeable future? I can also try to contribute once I have more bandwidth!

@ejguan
Copy link
Contributor Author

ejguan commented Jan 25, 2022

One option to support backward compatibility (new DataPipe with old DataLoader) is to modify Saver with a mutex or wait until a process has finished writing to the file.
Edit: Then, we don't need to wait until DataLoader2 coming out.

This seems like a promising approach. Do you guys have any plans to work on this in the foreseeable future? I can also try to contribute once I have more bandwidth!

Feel free to grab it. I might not have time to add it in the next a couple of weeks.

@parmeet
Copy link
Contributor

parmeet commented Apr 6, 2022

@ejguan I wonder if there is someone looking at this issue and trying to implement mutex around saver datapipe?
cc: @hudeven

@ejguan
Copy link
Contributor Author

ejguan commented Apr 6, 2022

Sorry that I haven't found time to work on this. Stuck with tasks for DataLoader2

@NivekT
Copy link
Contributor

NivekT commented Apr 8, 2022

One option to support backward compatibility (new DataPipe with old DataLoader) is to modify Saver with a mutex or wait until a process has finished writing to the file.

Edit: Then, we don't need to wait until DataLoader2 coming out.

It seems like we need something within DataLoader to handle all these locks across writers of different DataPipes? I think wrapping a mutex around a single Saver will not suffice since another Saver may still try to access the same file for writing. Is this correct?

@ejguan
Copy link
Contributor Author

ejguan commented Apr 11, 2022

A mutex shared across processes would be sufficient in my mind. But, I don't know if there is a cross-platform solution for it.

Another solution would be letting Dataloader to make sure this kind of DataPipe is non-shardable. Then, there will be only a single instance across processes. And, the result of this DataPipe would be sent to other processes.

@parmeet, @Nayef211 and @hudeven , as a work around to prevent DataLoader2 or mutex blocking the users of TorchText, could we add an extra logic in each Dataset's function to directly download the dataset to bypass Saver for now.

@parmeet
Copy link
Contributor

parmeet commented Apr 11, 2022

A mutex shared across processes would be sufficient in my mind. But, I don't know if there is a cross-platform solution for it.

@hudeven is looking into the solution in this diff D35459528. Basically it is using io_path's file locking mechanism which internally depends on portalocker. would this be a viable cross-platform solution?

as a work around to prevent DataLoader2 or mutex blocking the users of TorchText, could we add an extra logic in each Dataset's function to directly download the dataset to bypass Saver for now.

Sorry, I don't fully follow this. Since each process is instantiating and executing the dataset, I guess the problem of multiple processes writing to the same file would still be there if if we do direct download right?

cc: @NicolasHug Wondering if you are also encountering into this issue with DDP and vision datasets when multiple processes trying to write to the same cache file?

@NicolasHug
Copy link
Member

Thanks for the ping Parmeet. I haven't encountered this issue thus far, because torchvision datasets do not write anything to disk.

@NivekT
Copy link
Contributor

NivekT commented Apr 12, 2022

A mutex shared across processes would be sufficient in my mind. But, I don't know if there is a cross-platform solution for it.

@hudeven is looking into the solution in this diff D35459528. Basically it is using io_path's file locking mechanism which internally depends on portalocker. would this be a viable cross-platform solution?

I think if we can incorporate that into the IoPathSaver DataPipe, it should be a viable cross-platform solution, but it would mean users have to install iopath and portalocker if they wish to lock files across processes.

@NivekT NivekT added good first issue Good for newcomers help wanted Extra attention is needed labels May 6, 2022
@NivekT
Copy link
Contributor

NivekT commented May 17, 2022

I think this is fixed by #395 and #409 (the latter hasn't landed)

Are we missing anything else when those PRs both land?

VitalyFedyunin added a commit that referenced this issue May 17, 2022
VitalyFedyunin added a commit that referenced this issue May 17, 2022
VitalyFedyunin added a commit that referenced this issue May 17, 2022
VitalyFedyunin added a commit that referenced this issue May 17, 2022
VitalyFedyunin added a commit that referenced this issue May 17, 2022
VitalyFedyunin added a commit that referenced this issue May 17, 2022
VitalyFedyunin added a commit that referenced this issue May 17, 2022
VitalyFedyunin added a commit that referenced this issue May 17, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
…ache downloading twice"


Fixes #144

Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060)

[ghstack-poisoned]
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
…ache downloading twice"


Fixes #144

Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060)

[ghstack-poisoned]
VitalyFedyunin added a commit that referenced this issue May 18, 2022
VitalyFedyunin added a commit that referenced this issue May 18, 2022
…ache downloading twice"


Fixes #144

Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060)

[ghstack-poisoned]
VitalyFedyunin added a commit that referenced this issue May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed high priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants