Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching results of the filter will result to inconsistent cache state #1735

Closed
VitalyFedyunin opened this issue May 20, 2022 · 5 comments · Fixed by #1737
Closed

Caching results of the filter will result to inconsistent cache state #1735

VitalyFedyunin opened this issue May 20, 2022 · 5 comments · Fixed by #1737

Comments

@VitalyFedyunin
Copy link
Contributor

Currently it blocks, but we just got lucky:

cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").read_from_tar().filter(_filter_fn)

Please change the order of filter and end_caching

@Nayef211
Copy link
Contributor

@VitalyFedyunin just wondering if this could be the cause of the test failures I'm seeing in my PR? #1732 (comment)

@parmeet
Copy link
Contributor

parmeet commented May 22, 2022

Yes, @VitalyFedyunin already told me about this. One thing I am wondering is why it is causing failures in STSB and wikitext103 only? We do filtering in between on_disk_cache and end_caching for all the compressed datasets. Should we change to order in other datasets too?

@VitalyFedyunin
Copy link
Contributor Author

Yes please change all datasets. I think it fails when you try to zip two data pipes in tests and as they are separate graph parts but locking the same files this deadlock happens.

@parmeet
Copy link
Contributor

parmeet commented May 24, 2022

Yes please change all datasets. I think it fails when you try to zip two data pipes in tests and as they are separate graph parts but locking the same files this deadlock happens.

This PR fixes the issue #1737. Seems like empty/non-existent files/path are creating the problem and it is not really necessary to put filter after end caching. In-fact this issue brought up the bugs in our mock testing for the failing datasets :). The reason we are keeping filter in-between is because we do not want to dump all the files to disk but only the one necessary to build the dataset.

@VitalyFedyunin
Copy link
Contributor Author

I just afraid that this filter pattern is error prone. And even if I fix deadlock (it is possible), putting filter might lead to cache inconsistency (in case of filter output change between runs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants