Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding lock mechanism to prevent on_disk_cache downloading twice #409

Closed

Conversation

VitalyFedyunin
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin commented May 16, 2022

VitalyFedyunin added a commit that referenced this pull request May 16, 2022
ghstack-source-id: 2fb276289021ba8d47f60bc7480f6bddcc58b0fd
Pull Request resolved: #409
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 16, 2022
VitalyFedyunin added a commit that referenced this pull request May 16, 2022
ghstack-source-id: dd94ca8f4abac335e1c7ab72fa512660c333c7e8
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 16, 2022
ghstack-source-id: ae6dbd830f2217512f05757df0b062cc88b3a223
Pull Request resolved: #409
@VitalyFedyunin VitalyFedyunin requested a review from ejguan May 16, 2022 19:28
@VitalyFedyunin
Copy link
Contributor Author

Doing cleanup now, but principle had to remain. The main problem is the fact that we had to put 'promise' files to avoid downloading twice.

VitalyFedyunin added a commit that referenced this pull request May 16, 2022
ghstack-source-id: f1b192bc909639bff7f7ed39ab23a0db1ee0f47b
Pull Request resolved: #409
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add a dependency to portalocker. I think our options are:

  1. Soft dependency - only use the locking mechanism when portalocker is installed, but otherwise the DataPipes here still work without it
  2. Semi-soft (?) - raise warnings/error when these DataPipes are used but portalocker isn't used. Maybe in DLv2 when mp is enabled (potentially over-engineering)?
  3. Hard dependency - always use it and add it to our list of dependencies

VitalyFedyunin added a commit that referenced this pull request May 16, 2022
ghstack-source-id: de27d8c1e9cca8a6d799aa0a16044ecb48f73fdc
Pull Request resolved: #409
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!!

@VitalyFedyunin
Copy link
Contributor Author

Added as hard dependency.

@VitalyFedyunin
Copy link
Contributor Author

TODOs:
[ ] Remove lambda for full tests coverage
[ ] Add timeout to make sure we exit after waiting promise for two long (default 2 min), as it could happen when filename_fns are different

VitalyFedyunin added a commit that referenced this pull request May 16, 2022
ghstack-source-id: a0cda6310f51846559d674ed6862fac2c4be349a
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 16, 2022
ghstack-source-id: acfa64d27bc3f3ef3b0c6bbfc3293076deaf3dc1
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 17, 2022
ghstack-source-id: 03fadd5333a104a6ab6843f75ccee6590fd52c0c
Pull Request resolved: #409
@VitalyFedyunin
Copy link
Contributor Author

Had to add promise concept, to make sure that workers waits 'main' worker to finish downloading/unpacking. To avoid DataLoader starvation we also fetch all file names into infinite buffer.

@VitalyFedyunin VitalyFedyunin requested a review from ejguan May 17, 2022 20:22
VitalyFedyunin added a commit that referenced this pull request May 17, 2022
ghstack-source-id: f790c95b8e78fc8de143c520911cb2c244b485b9
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 17, 2022
ghstack-source-id: 466cfb00b831f87105675e4d3b55b7d299ac5c65
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: 2737ef4a4f237a2ed872d4e435d2f65f0905ead9
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: d1316874c2148aa5c8c2c9dcb6f831c0fd3adf7e
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: b73c4862c1f27e5ebe768f56f612610a83c5d59b
Pull Request resolved: #409
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you soooo much for adding this fix. I do have one comment on 1-to-n scenario.

Comment on lines +630 to +631
self.assertEqual(2, len(all_files))
self.assertEqual("str", result[0][1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to verify, len(result) should be 1, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. I'm creating one additional file inside of _slow_fn. So it would be 'downloaded' file and 'pid' file.

Comment on lines +222 to +230
file_exists = len(data) > 0
if not file_exists:
result = False
promise_fh.seek(0)
promise_fh.write("[dataloader session uid]")
promise_fh.truncate()
promise_fh.flush()

return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the cached op is 1-to-n like decompression from archive, if any decompressed file is missing or has incorrect hash, we can directly return False and no need to check other files IMHO.
There can be a chance that multiple processes are locking different decompressed files for an archive. Then, both processes will run decompression -> racing condition again.

So, I think we should lock over data but not filepaths (data represents the compressed archive in this case). For the process that observes promise file over data, they can directly return True.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately data could be url or something else, it is hard to lock on it.

But this situation is covered.

Imagine data generates two file namesL file1 file2.

Initial pass (empty FS) will add two locks file1.promise and file2.promise and will go 'False' route.

Now second (and every next) pass will see that files are missing, but will fail to create promise and go into the 'file exists' route, which will led them to the situation when they are waiting for file1.promise and file2.promise to disappear.

Copy link
Contributor

@NivekT NivekT May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is an URL, is it possible to create and lock root/URL.promise in the file system?

I think we should have a similar lock for HttpReader to prevent multiple processes from downloading the same file? Nevermind

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. file_exists flag is used for processes to recognize this file or parent archives are going to be processed by another process.

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: 4c90294442e1d57a42e3043216248f1f2c8e6c6f
Pull Request resolved: #409
setup.py Outdated Show resolved Hide resolved
Comment on lines +222 to +230
file_exists = len(data) > 0
if not file_exists:
result = False
promise_fh.seek(0)
promise_fh.write("[dataloader session uid]")
promise_fh.truncate()
promise_fh.flush()

return result
Copy link
Contributor

@NivekT NivekT May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is an URL, is it possible to create and lock root/URL.promise in the file system?

I think we should have a similar lock for HttpReader to prevent multiple processes from downloading the same file? Nevermind

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: 18c6117332ad55da123609de06b4c6a2c0a2aa0f
Pull Request resolved: #409
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your explanation. LGTM

Comment on lines +222 to +230
file_exists = len(data) > 0
if not file_exists:
result = False
promise_fh.seek(0)
promise_fh.write("[dataloader session uid]")
promise_fh.truncate()
promise_fh.flush()

return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. file_exists flag is used for processes to recognize this file or parent archives are going to be processed by another process.

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: 72913b5a58a7c1bb57ca83e91a7e3a8725a95ed1
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: 547d120247591515cc5d8be8e84ed104451392a2
Pull Request resolved: #409
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: 1f96fb90ef977d8e9b64b46cdbbe522d2bab1c10
Pull Request resolved: #409
VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: bfc8be4d8fb28cb942b510b1dbf3044820383efe
Pull Request resolved: #409
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

VitalyFedyunin added a commit that referenced this pull request May 18, 2022
ghstack-source-id: d0934ff494a2f5d49b401af1e6a79aefb53e2319
Pull Request resolved: #409
@VitalyFedyunin
Copy link
Contributor Author

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants