Adding parameterized dataset pickling tests #1732

Nayef211 · 2022-05-17T23:29:54Z

Description

Followup to Replacing lambda functions with regular functions in all datasets #1718 (comment)
Adding parameterized test to test pickling for all datasets
Refactor local functions to be global so that they can be pickled as a follow up to For Datasets, refactor local functions to be global so that they can be pickled #1726

Testing

pytest test/datasets/common.py

Nayef211 · 2022-05-17T23:33:10Z

@NivekT @ejguan I'm seeing some failures when trying to pickle the IWSLT datasets that look like the following. I believe open_files is a datapipe that resides in pytorch core. Any idea what could be going on?

______________________________________________________________________________________________________________________________________________________________________________________________ TestDatasetPickling.test_pickling_09 ______________________________________________________________________________________________________________________________________________________________________________________________

a = (<test.datasets.common.TestDatasetPickling testMethod=test_pickling_09>,)

    @wraps(func)
    def standalone_func(*a):
>       return func(*(a + p.args), **p.kwargs)

../../opt/miniconda3/envs/torchtext/lib/python3.9/site-packages/parameterized/parameterized.py:533:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/datasets/common.py:116: in test_pickling
    expected_samples, dp1 = _generate_mock_dataset(dataset_fn, get_mock_dataset_fn, self.root_dir)
test/datasets/common.py:73: in _generate_mock_dataset
    dp = dataset_fn(
torchtext/data/datasets_utils.py:193: in wrapper
    return fn(root=new_root, *args, **kwargs)
torchtext/data/datasets_utils.py:155: in new_fn
    result.append(fn(root, item, **kwargs))
torchtext/datasets/iwslt2017.py:235: in IWSLT2017
    cache_decompressed_dp = cache_decompressed_dp.open_files(mode="b").load_from_tar()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torchdata.datapipes.iter.util.cacheholder.OnDiskCacheHolderIterDataPipe object at 0x7fe37817c310>, attribute_name = 'open_files'

    def __getattr__(self, attribute_name):
        if attribute_name in IterDataPipe.functions:
            function = functools.partial(IterDataPipe.functions[attribute_name], self)
            return function
        else:
>           raise AttributeError("'{0}' object has no attribute '{1}".format(self.__class__.__name__, attribute_name))
E           AttributeError: 'OnDiskCacheHolderIterDataPipe' object has no attribute 'open_files

../../opt/miniconda3/envs/torchtext/lib/python3.9/site-packages/torch/utils/data/dataset.py:300: AttributeError

NivekT · 2022-05-17T23:37:58Z

@Nayef211 The API open_files was only recently added to TorchData, what version of the library are you using? Is the error coming from one of the CI tests?

Nayef211 · 2022-05-18T02:48:08Z

@Nayef211 The API open_files was only recently added to TorchData, what version of the library are you using? Is the error coming from one of the CI tests?

Gotcha thanks for the heads up. I was only seeing these errors locally, but all the tests are passing after I installed the latest torchdata nightly!

parmeet · 2022-05-18T03:02:04Z

Thanks @Nayef211 for adding these tests. Actually I had something much simple in mind (pseudo code below without putting it in proper test suit). Let me know if this makes sense?

from torchtext.datasets import DATASETS
for f in DATASETS.values():
    dp = f()
    test_pickable(dp)

def test_pickable(dp):
    if type(dp)==tuple:
        for dp_split in dp:
            pickle.dump(dp, open("temp.pkl",w)) # ensure that no exception is raised here
    else:
         pickle.dump(dp, open("temp.pkl",w)) # ensure that no exception is raised here

Nayef211 · 2022-05-18T15:08:02Z

Thanks @Nayef211 for adding these tests. Actually I had something much simple in mind (pseudo code below without putting it in proper test suit). Let me know if this makes sense?

from torchtext.datasets import DATASETS
for f in DATASETS.values():
    dp = f()
    test_pickable(dp)

def test_pickable(dp):
    if type(dp)==tuple:
        for dp_split in dp:
            pickle.dump(dp, open("temp.pkl",w)) # ensure that no exception is raised here
    else:
         pickle.dump(dp, open("temp.pkl",w)) # ensure that no exception is raised here

Thanks for the suggestion here @parmeet. This implementation does look a lot simpler. I wonder if we want to test reloading the datapipe from the pickle file and verify that the order of the dataset makes sense? Or do we not think that's relevant to this test?

parmeet · 2022-05-18T15:32:19Z

I wonder if we want to test reloading the datapipe from the pickle file and verify that the order of the dataset makes sense?

Not sure if I understand what you meant by order?

Or do we not think that's relevant to this test?

Hmm, I don't know if this is required. If Pickle dump works, it should give back exact same object, I wonder if it is necessary to check the sanity of loading pickled datapipe? @ejguan, @NivekT do you have some suggestions here or do you think that just ensuring the Pickle works is sufficient for our use-case?

Nayef211 · 2022-05-18T15:53:12Z

Not sure if I understand what you meant by order?

Sorry I meant to say the contents of the datapipe which are tested using something as follows

for sample, expected_sample in zip_equal(samples, expected_samples):
    self.assertEqual(sample, expected_sample)

Hmm, I don't know if this is required. If Pickle dump works, it should give back exact same object, I wonder if it is necessary to check the sanity of loading pickled datapipe? @ejguan, @NivekT do you have some suggestions here or do you think that just ensuring the Pickle works is sufficient for our use-case?

I guess if we do go with this approach, testing the loading is just one additional line and will allow us to verify that both the saving and loading mechanism are working as intended. But happy to hear suggestions from the torchdata folks.

NivekT · 2022-05-18T17:45:12Z

I think pickle.loads(pickle.dumps(dp)) should be enough to see if the serialization works, but the safest bet is passing it through DataLoader with multiprocessing (for the reason that @Nayef211 mentioned) and see if the sample that comes out is as expected.

cc: @ejguan

Nayef211 · 2022-05-18T21:16:00Z

I think pickle.loads(pickle.dumps(dp)) should be enough to see if the serialization works, but the safest bet is passing it through DataLoader with multiprocessing (for the reason that @Nayef211 mentioned) and see if the sample that comes out is as expected.

For the sake of simplicity I think I will go with the approach @parmeet suggested. Maybe we can follow up with a test on a single dataset that verifies the contents of the datapipes are correct.

parmeet

LGTM! Thanks @Nayef211 for adding this test :)

parmeet · 2022-05-19T19:17:49Z

there seems to be some test failures for MNLI dataset? cc: @VirgileHlav

Nayef211 · 2022-05-19T20:16:50Z

there seems to be some test failures for MNLI dataset? cc: @VirgileHlav

I had a small bug in the implementation that happened when making the functions in MNLI global. Should be fixed now

Nayef211 · 2022-05-19T22:32:28Z

@NivekT @ejguan @VitalyFedyunin I'm seeing some test failures that look related to the recent changes made to on_disk_cache by pytorch/data#409. Could this be a bug on your end or do we need to make changes to how we're utilizing that datapipe?

Adding test to check if datasets are pickleable. Fixing IMDB local fn

81edbf9

facebook-github-bot added the cla signed label May 17, 2022

Nayef211 requested a review from parmeet May 17, 2022 23:33

Nayef211 changed the title ~~Adding dataset pickling tests~~ Adding parameterized dataset pickling tests May 17, 2022

Nayef211 marked this pull request as ready for review May 18, 2022 02:52

Nayef211 added 3 commits May 18, 2022 18:01

Merge branch 'main' into test/dataset_pickle

040f3d1

Removed local functions from dataaset implementations

44945eb

Updated parameterization inputs

fda61cd

Nayef211 mentioned this pull request May 19, 2022

Add Shuffle and sharding datapipes to datasets #1729

Merged

parmeet approved these changes May 19, 2022

View reviewed changes

Nayef211 added 2 commits May 19, 2022 16:08

Merge branch 'main' into test/dataset_pickle

ae17fbc

Fixing stsb

939276e

Nayef211 mentioned this pull request May 22, 2022

Caching results of the filter will result to inconsistent cache state #1735

Closed

Nayef211 merged commit e548d3f into pytorch:main May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding parameterized dataset pickling tests #1732

Adding parameterized dataset pickling tests #1732

Nayef211 commented May 17, 2022 •

edited

Loading

Nayef211 commented May 17, 2022

NivekT commented May 17, 2022

Nayef211 commented May 18, 2022

parmeet commented May 18, 2022

Nayef211 commented May 18, 2022

parmeet commented May 18, 2022

Nayef211 commented May 18, 2022

NivekT commented May 18, 2022 •

edited

Loading

Nayef211 commented May 18, 2022

parmeet left a comment

parmeet commented May 19, 2022

Nayef211 commented May 19, 2022 •

edited

Loading

Nayef211 commented May 19, 2022

Adding parameterized dataset pickling tests #1732

Adding parameterized dataset pickling tests #1732

Conversation

Nayef211 commented May 17, 2022 • edited Loading

Description

Testing

Nayef211 commented May 17, 2022

NivekT commented May 17, 2022

Nayef211 commented May 18, 2022

parmeet commented May 18, 2022

Nayef211 commented May 18, 2022

parmeet commented May 18, 2022

Nayef211 commented May 18, 2022

NivekT commented May 18, 2022 • edited Loading

Nayef211 commented May 18, 2022

parmeet left a comment

Choose a reason for hiding this comment

parmeet commented May 19, 2022

Nayef211 commented May 19, 2022 • edited Loading

Nayef211 commented May 19, 2022

Nayef211 commented May 17, 2022 •

edited

Loading

NivekT commented May 18, 2022 •

edited

Loading

Nayef211 commented May 19, 2022 •

edited

Loading