Caching results of the `filter` will result to inconsistent cache state #1735

VitalyFedyunin · 2022-05-20T21:10:23Z

Currently it blocks, but we just got lucky:

Line 85 in caaa8e3

    
           cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").read_from_tar().filter(_filter_fn)

Please change the order of filter and end_caching

The text was updated successfully, but these errors were encountered:

Nayef211 · 2022-05-22T16:18:02Z

@VitalyFedyunin just wondering if this could be the cause of the test failures I'm seeing in my PR? #1732 (comment)

parmeet · 2022-05-22T19:49:20Z

Yes, @VitalyFedyunin already told me about this. One thing I am wondering is why it is causing failures in STSB and wikitext103 only? We do filtering in between on_disk_cache and end_caching for all the compressed datasets. Should we change to order in other datasets too?

VitalyFedyunin · 2022-05-23T21:58:14Z

Yes please change all datasets. I think it fails when you try to zip two data pipes in tests and as they are separate graph parts but locking the same files this deadlock happens.

parmeet · 2022-05-24T00:15:30Z

Yes please change all datasets. I think it fails when you try to zip two data pipes in tests and as they are separate graph parts but locking the same files this deadlock happens.

This PR fixes the issue #1737. Seems like empty/non-existent files/path are creating the problem and it is not really necessary to put filter after end caching. In-fact this issue brought up the bugs in our mock testing for the failing datasets :). The reason we are keeping filter in-between is because we do not want to dump all the files to disk but only the one necessary to build the dataset.

VitalyFedyunin · 2022-05-25T18:54:40Z

I just afraid that this filter pattern is error prone. And even if I fix deadlock (it is possible), putting filter might lead to cache inconsistency (in case of filter output change between runs).

VitalyFedyunin mentioned this issue May 20, 2022

Caching will not unblock pipes if there is a filter between on_disk_cache and end_caching pytorch/data#434

Open

parmeet mentioned this issue May 23, 2022

Fix STSB and WikiTexts tests #1737

Merged

parmeet closed this as completed in #1737 May 23, 2022

Nayef211 mentioned this issue May 26, 2022

Unable to download IWSLT datasets #1676

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching results of the `filter` will result to inconsistent cache state #1735

Caching results of the `filter` will result to inconsistent cache state #1735

VitalyFedyunin commented May 20, 2022

Nayef211 commented May 22, 2022

parmeet commented May 22, 2022

VitalyFedyunin commented May 23, 2022

parmeet commented May 24, 2022

VitalyFedyunin commented May 25, 2022

Caching results of the filter will result to inconsistent cache state #1735

Caching results of the filter will result to inconsistent cache state #1735

Comments

VitalyFedyunin commented May 20, 2022

Nayef211 commented May 22, 2022

parmeet commented May 22, 2022

VitalyFedyunin commented May 23, 2022

parmeet commented May 24, 2022

VitalyFedyunin commented May 25, 2022

Caching results of the `filter` will result to inconsistent cache state #1735

Caching results of the `filter` will result to inconsistent cache state #1735