-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding lock mechanism to prevent on_disk_cache downloading twice #409
Adding lock mechanism to prevent on_disk_cache downloading twice #409
Conversation
[ghstack-poisoned]
ghstack-source-id: 2fb276289021ba8d47f60bc7480f6bddcc58b0fd Pull Request resolved: #409
… twice" [ghstack-poisoned]
ghstack-source-id: dd94ca8f4abac335e1c7ab72fa512660c333c7e8 Pull Request resolved: #409
… twice" [ghstack-poisoned]
ghstack-source-id: ae6dbd830f2217512f05757df0b062cc88b3a223 Pull Request resolved: #409
Doing cleanup now, but principle had to remain. The main problem is the fact that we had to put 'promise' files to avoid downloading twice. |
… twice" [ghstack-poisoned]
ghstack-source-id: f1b192bc909639bff7f7ed39ab23a0db1ee0f47b Pull Request resolved: #409
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add a dependency to portalocker
. I think our options are:
- Soft dependency - only use the locking mechanism when
portalocker
is installed, but otherwise the DataPipes here still work without it - Semi-soft (?) - raise warnings/error when these DataPipes are used but
portalocker
isn't used. Maybe in DLv2 whenmp
is enabled (potentially over-engineering)? - Hard dependency - always use it and add it to our list of dependencies
… twice" [ghstack-poisoned]
ghstack-source-id: de27d8c1e9cca8a6d799aa0a16044ecb48f73fdc Pull Request resolved: #409
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!!
Added as hard dependency. |
TODOs: |
… twice" [ghstack-poisoned]
ghstack-source-id: a0cda6310f51846559d674ed6862fac2c4be349a Pull Request resolved: #409
… twice" [ghstack-poisoned]
ghstack-source-id: acfa64d27bc3f3ef3b0c6bbfc3293076deaf3dc1 Pull Request resolved: #409
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: 03fadd5333a104a6ab6843f75ccee6590fd52c0c Pull Request resolved: #409
Had to add promise concept, to make sure that workers waits 'main' worker to finish downloading/unpacking. To avoid DataLoader starvation we also fetch all file names into infinite buffer. |
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: f790c95b8e78fc8de143c520911cb2c244b485b9 Pull Request resolved: #409
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: 466cfb00b831f87105675e4d3b55b7d299ac5c65 Pull Request resolved: #409
ghstack-source-id: 2737ef4a4f237a2ed872d4e435d2f65f0905ead9 Pull Request resolved: #409
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: d1316874c2148aa5c8c2c9dcb6f831c0fd3adf7e Pull Request resolved: #409
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: b73c4862c1f27e5ebe768f56f612610a83c5d59b Pull Request resolved: #409
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you soooo much for adding this fix. I do have one comment on 1-to-n scenario.
self.assertEqual(2, len(all_files)) | ||
self.assertEqual("str", result[0][1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to verify, len(result)
should be 1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. I'm creating one additional file inside of _slow_fn. So it would be 'downloaded' file and 'pid' file.
file_exists = len(data) > 0 | ||
if not file_exists: | ||
result = False | ||
promise_fh.seek(0) | ||
promise_fh.write("[dataloader session uid]") | ||
promise_fh.truncate() | ||
promise_fh.flush() | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the cached op is 1-to-n like decompression from archive, if any decompressed file is missing or has incorrect hash, we can directly return False
and no need to check other files IMHO.
There can be a chance that multiple processes are locking different decompressed files for an archive. Then, both processes will run decompression -> racing condition again.
So, I think we should lock over data
but not filepaths
(data
represents the compressed archive in this case). For the process that observes promise
file over data, they can directly return True.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately data could be url or something else, it is hard to lock on it.
But this situation is covered.
Imagine data generates two file namesL file1 file2.
Initial pass (empty FS) will add two locks file1.promise and file2.promise and will go 'False' route.
Now second (and every next) pass will see that files are missing, but will fail to create promise and go into the 'file exists' route, which will led them to the situation when they are waiting for file1.promise and file2.promise to disappear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is an URL, is it possible to create and lock root/URL.promise
in the file system?
I think we should have a similar lock for NevermindHttpReader
to prevent multiple processes from downloading the same file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. file_exists
flag is used for processes to recognize this file or parent archives are going to be processed by another process.
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: 4c90294442e1d57a42e3043216248f1f2c8e6c6f Pull Request resolved: #409
file_exists = len(data) > 0 | ||
if not file_exists: | ||
result = False | ||
promise_fh.seek(0) | ||
promise_fh.write("[dataloader session uid]") | ||
promise_fh.truncate() | ||
promise_fh.flush() | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is an URL, is it possible to create and lock root/URL.promise
in the file system?
I think we should have a similar lock for NevermindHttpReader
to prevent multiple processes from downloading the same file?
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: 18c6117332ad55da123609de06b4c6a2c0a2aa0f Pull Request resolved: #409
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation. LGTM
file_exists = len(data) > 0 | ||
if not file_exists: | ||
result = False | ||
promise_fh.seek(0) | ||
promise_fh.write("[dataloader session uid]") | ||
promise_fh.truncate() | ||
promise_fh.flush() | ||
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. file_exists
flag is used for processes to recognize this file or parent archives are going to be processed by another process.
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: 72913b5a58a7c1bb57ca83e91a7e3a8725a95ed1 Pull Request resolved: #409
… twice" Fixes #144 [ghstack-poisoned]
ghstack-source-id: 547d120247591515cc5d8be8e84ed104451392a2 Pull Request resolved: #409
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
ghstack-source-id: 1f96fb90ef977d8e9b64b46cdbbe522d2bab1c10 Pull Request resolved: #409
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
ghstack-source-id: bfc8be4d8fb28cb942b510b1dbf3044820383efe Pull Request resolved: #409
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… twice" Fixes #144 Differential Revision: [D36489060](https://our.internmc.facebook.com/intern/diff/D36489060) [ghstack-poisoned]
ghstack-source-id: d0934ff494a2f5d49b401af1e6a79aefb53e2319 Pull Request resolved: #409
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
Fixes #144
Differential Revision: D36489060