Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added file_lock in IoPathSaverIterDataPipe #395

Closed
wants to merge 0 commits into from
Closed

Added file_lock in IoPathSaverIterDataPipe #395

wants to merge 0 commits into from

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented May 10, 2022

Untested, made this PR to make communication easier

Partially fixes #144

The main change is to lock a file when multiple processes are writing to the same file

The fix was upstreamed from an internal fix @hudeven made

with iopath.file_lock(filepath):
    if not os.path.exists(filepath):
        with self.pathmgr.open(filepath, self.mode) as f:
            f.write(data)

And the test was extended from a gist @ejguan sent me

It does seem to be working as intended right now, @parmeet please let me know if this works for you

Logs with fix

Ran 17 tests in 3.982s

OK (skipped=1)

Logs without fix

Traceback (most recent call last):
  File "/home/ubuntu/data/test/test_local_io.py", line 638, in test_io_path_saver_iterdatapipe
    self.assertEqual(expected_paths, res_file_paths)
AssertionError: Lists differ: ['/tmp/tmpbgt32c3h/1.txt', '/tmp/tmpbgt32c3h/2.txt', '/tmp/tmpbgt32c3h/3.txt'] != ['/tmp/tmpbgt32c3h/3.txt']

First differing element 0:
'/tmp/tmpbgt32c3h/1.txt'
'/tmp/tmpbgt32c3h/3.txt'

First list contains 2 additional elements.
First extra element 1:
'/tmp/tmpbgt32c3h/2.txt'

- ['/tmp/tmpbgt32c3h/1.txt', '/tmp/tmpbgt32c3h/2.txt', '/tmp/tmpbgt32c3h/3.txt']
+ ['/tmp/tmpbgt32c3h/3.txt']


@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 10, 2022
@msaroufim msaroufim requested review from ejguan and NivekT May 10, 2022 23:21
@msaroufim msaroufim marked this pull request as draft May 11, 2022 00:06
@msaroufim
Copy link
Member Author

msaroufim commented May 11, 2022

As far as the test goes it seems like the IoPathSaver is tested here only https://github.com/pytorch/data/blob/main/test/test_local_io.py#L635

So will extend this test to multiple processes and see how CI works on Windows and Linux

@parmeet do you have some existing test/script I can leverage? Just wanna make sure your usecase is solved

@parmeet
Copy link
Contributor

parmeet commented May 11, 2022

do you have some existing test/script I can leverage? Just wanna make sure your usecase is solved

Unfortunately, no. we observed this issue when running DDP while training doc classification workflow using TorchRecipes internally. cc: @hudeven

@msaroufim msaroufim requested a review from NivekT May 12, 2022 01:54
@msaroufim msaroufim marked this pull request as ready for review May 12, 2022 01:58
@VitalyFedyunin
Copy link
Contributor

It does not fixes #144 as on_disk_cache using

class SaverIterDataPipe(IterDataPipe[str]):
to save files.

Also you probably need to cross lock writing with

def _cache_check_fn(data, filepath_fn, hash_dict, hash_type, extra_check_fn):

@ejguan please correct me if I'm wrong.

@VitalyFedyunin
Copy link
Contributor

Side comment: I'm ok to have portalocker as hard dependency.

@ejguan
Copy link
Contributor

ejguan commented May 12, 2022

It does not fixes #144 as on_disk_cache using

class SaverIterDataPipe(IterDataPipe[str]):

to save files.
Also you probably need to cross lock writing with

def _cache_check_fn(data, filepath_fn, hash_dict, hash_type, extra_check_fn):

@ejguan please correct me if I'm wrong.

Completely correct. We are using this iopath file to test portalocker. We can add another PR to fix Saver and CacheHolder

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment on test.

test/test_local_io.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix the followings before landing? Thank you!

test/test_local_io.py Outdated Show resolved Hide resolved
test/test_local_io.py Outdated Show resolved Hide resolved
test/test_local_io.py Outdated Show resolved Hide resolved
@msaroufim msaroufim requested a review from ejguan May 13, 2022 20:29
@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix lint as well in this file? Really appreciate.

test/test_local_io.py Outdated Show resolved Hide resolved
test/test_local_io.py Outdated Show resolved Hide resolved
@msaroufim
Copy link
Member Author

Note to self - there's now a genuine failure since I refactored self.fn to be a global function in the test class, as far as this saver goes, I need to make sure this gets fixed

    def __iter__(self) -> Iterator[str]:
        for filepath, data in self.source_datapipe:
            if self.fn is not None:
>               filepath = self.fn(filepath)
E               TypeError: filepath_fn() missing 1 required positional argument: 'name'
E               This exception is thrown by __iter__ of SaverIterDataPipe()

test/test_local_io.py Outdated Show resolved Hide resolved
test/test_local_io.py Outdated Show resolved Hide resolved
test/test_local_io.py Outdated Show resolved Hide resolved
test/test_local_io.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan
Copy link
Contributor

ejguan commented May 17, 2022

I think you might need to rebase onto main branch again. There are lots of changes that are not supposed to be in this PR.

@msaroufim msaroufim closed this May 17, 2022
facebook-github-bot pushed a commit that referenced this pull request May 17, 2022
Summary:
Please read through our [contribution guide](https://github.com/pytorch/data/blob/main/CONTRIBUTING.md) prior to
creating your pull request.

- Note that there is a section on requirements related to adding a new DataPipe.

Fixes #397

This is an identical PR to #395 which I accidentally deleted during a bad force push trying to resolve a rebase

Pull Request resolved: #413

Reviewed By: ejguan

Differential Revision: D36433231

Pulled By: msaroufim

fbshipit-source-id: 0f71790eaded445f8e42911db4bea4c468175520
@andrewkho andrewkho deleted the mpcache branch September 26, 2024 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiprocessing with any DataPipe writing to local file
6 participants