Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding parameterized dataset pickling tests #1732

Merged
merged 6 commits into from
May 23, 2022

Conversation

Nayef211
Copy link
Contributor

@Nayef211 Nayef211 commented May 17, 2022

Description

Testing

pytest test/datasets/common.py

@Nayef211
Copy link
Contributor Author

@NivekT @ejguan I'm seeing some failures when trying to pickle the IWSLT datasets that look like the following. I believe open_files is a datapipe that resides in pytorch core. Any idea what could be going on?

______________________________________________________________________________________________________________________________________________________________________________________________ TestDatasetPickling.test_pickling_09 ______________________________________________________________________________________________________________________________________________________________________________________________

a = (<test.datasets.common.TestDatasetPickling testMethod=test_pickling_09>,)

    @wraps(func)
    def standalone_func(*a):
>       return func(*(a + p.args), **p.kwargs)

../../opt/miniconda3/envs/torchtext/lib/python3.9/site-packages/parameterized/parameterized.py:533:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/datasets/common.py:116: in test_pickling
    expected_samples, dp1 = _generate_mock_dataset(dataset_fn, get_mock_dataset_fn, self.root_dir)
test/datasets/common.py:73: in _generate_mock_dataset
    dp = dataset_fn(
torchtext/data/datasets_utils.py:193: in wrapper
    return fn(root=new_root, *args, **kwargs)
torchtext/data/datasets_utils.py:155: in new_fn
    result.append(fn(root, item, **kwargs))
torchtext/datasets/iwslt2017.py:235: in IWSLT2017
    cache_decompressed_dp = cache_decompressed_dp.open_files(mode="b").load_from_tar()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <torchdata.datapipes.iter.util.cacheholder.OnDiskCacheHolderIterDataPipe object at 0x7fe37817c310>, attribute_name = 'open_files'

    def __getattr__(self, attribute_name):
        if attribute_name in IterDataPipe.functions:
            function = functools.partial(IterDataPipe.functions[attribute_name], self)
            return function
        else:
>           raise AttributeError("'{0}' object has no attribute '{1}".format(self.__class__.__name__, attribute_name))
E           AttributeError: 'OnDiskCacheHolderIterDataPipe' object has no attribute 'open_files

../../opt/miniconda3/envs/torchtext/lib/python3.9/site-packages/torch/utils/data/dataset.py:300: AttributeError

@Nayef211 Nayef211 requested a review from parmeet May 17, 2022 23:33
@Nayef211 Nayef211 changed the title Adding dataset pickling tests Adding parameterized dataset pickling tests May 17, 2022
@NivekT
Copy link
Contributor

NivekT commented May 17, 2022

@Nayef211 The API open_files was only recently added to TorchData, what version of the library are you using? Is the error coming from one of the CI tests?

@Nayef211
Copy link
Contributor Author

@Nayef211 The API open_files was only recently added to TorchData, what version of the library are you using? Is the error coming from one of the CI tests?

Gotcha thanks for the heads up. I was only seeing these errors locally, but all the tests are passing after I installed the latest torchdata nightly!

@Nayef211 Nayef211 marked this pull request as ready for review May 18, 2022 02:52
@parmeet
Copy link
Contributor

parmeet commented May 18, 2022

Thanks @Nayef211 for adding these tests. Actually I had something much simple in mind (pseudo code below without putting it in proper test suit). Let me know if this makes sense?

from torchtext.datasets import DATASETS
for f in DATASETS.values():
    dp = f()
    test_pickable(dp)

def test_pickable(dp):
    if type(dp)==tuple:
        for dp_split in dp:
            pickle.dump(dp, open("temp.pkl",w)) # ensure that no exception is raised here
    else:
         pickle.dump(dp, open("temp.pkl",w)) # ensure that no exception is raised here

@Nayef211
Copy link
Contributor Author

Thanks @Nayef211 for adding these tests. Actually I had something much simple in mind (pseudo code below without putting it in proper test suit). Let me know if this makes sense?

from torchtext.datasets import DATASETS
for f in DATASETS.values():
    dp = f()
    test_pickable(dp)

def test_pickable(dp):
    if type(dp)==tuple:
        for dp_split in dp:
            pickle.dump(dp, open("temp.pkl",w)) # ensure that no exception is raised here
    else:
         pickle.dump(dp, open("temp.pkl",w)) # ensure that no exception is raised here

Thanks for the suggestion here @parmeet. This implementation does look a lot simpler. I wonder if we want to test reloading the datapipe from the pickle file and verify that the order of the dataset makes sense? Or do we not think that's relevant to this test?

@parmeet
Copy link
Contributor

parmeet commented May 18, 2022

I wonder if we want to test reloading the datapipe from the pickle file and verify that the order of the dataset makes sense?

Not sure if I understand what you meant by order?

Or do we not think that's relevant to this test?

Hmm, I don't know if this is required. If Pickle dump works, it should give back exact same object, I wonder if it is necessary to check the sanity of loading pickled datapipe? @ejguan, @NivekT do you have some suggestions here or do you think that just ensuring the Pickle works is sufficient for our use-case?

@Nayef211
Copy link
Contributor Author

Not sure if I understand what you meant by order?

Sorry I meant to say the contents of the datapipe which are tested using something as follows

for sample, expected_sample in zip_equal(samples, expected_samples):
    self.assertEqual(sample, expected_sample)

Hmm, I don't know if this is required. If Pickle dump works, it should give back exact same object, I wonder if it is necessary to check the sanity of loading pickled datapipe? @ejguan, @NivekT do you have some suggestions here or do you think that just ensuring the Pickle works is sufficient for our use-case?

I guess if we do go with this approach, testing the loading is just one additional line and will allow us to verify that both the saving and loading mechanism are working as intended. But happy to hear suggestions from the torchdata folks.

@NivekT
Copy link
Contributor

NivekT commented May 18, 2022

I think pickle.loads(pickle.dumps(dp)) should be enough to see if the serialization works, but the safest bet is passing it through DataLoader with multiprocessing (for the reason that @Nayef211 mentioned) and see if the sample that comes out is as expected.

cc: @ejguan

@Nayef211
Copy link
Contributor Author

I think pickle.loads(pickle.dumps(dp)) should be enough to see if the serialization works, but the safest bet is passing it through DataLoader with multiprocessing (for the reason that @Nayef211 mentioned) and see if the sample that comes out is as expected.

For the sake of simplicity I think I will go with the approach @parmeet suggested. Maybe we can follow up with a test on a single dataset that verifies the contents of the datapipes are correct.

Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @Nayef211 for adding this test :)

@parmeet
Copy link
Contributor

parmeet commented May 19, 2022

there seems to be some test failures for MNLI dataset? cc: @VirgileHlav

@Nayef211
Copy link
Contributor Author

Nayef211 commented May 19, 2022

there seems to be some test failures for MNLI dataset? cc: @VirgileHlav

I had a small bug in the implementation that happened when making the functions in MNLI global. Should be fixed now

@Nayef211
Copy link
Contributor Author

@NivekT @ejguan @VitalyFedyunin I'm seeing some test failures that look related to the recent changes made to on_disk_cache by pytorch/data#409. Could this be a bug on your end or do we need to make changes to how we're utilizing that datapipe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants