upgrade prototype test CI from 3.7 to 3.10 #5801

pmeier · 2022-04-11T06:55:31Z

@ejguan @NivekT starting for Python >= 3.8 we are seeing a lot of I/O warnings that are turned into errors by our test suite. I was under the impression that we don't need to handle file closing. Has this changed?

datumbox · 2022-04-11T08:00:00Z

@pmeier Sorry for jumping on a draft PR. To clarify, is this PR an investigation or you eventually propose to merge? Typically we want to pin all tests on the earliest supported version, so if indeed the problems disappear we probably don't want to merge this but rather investigate and fix the underlying issue. Probably that's what you have in mind as well but just confirming. Thanks!

pmeier · 2022-04-11T08:37:00Z

For now, this is just an investigation. @NicolasHug saw some failures locally that neither I nor CI could reproduce. Turns out, the errors (or better warnings turned into errors) only pop up for Python >= 3.8. This PR confirms that this is indeed the case.

It should not be merged in the current state, but maybe will be the basis for further investigations or fixes. I'll either close it or mark it as "ready to be reviewed".

pmeier · 2022-04-12T14:45:46Z

Looking into this more, I can't find any pattern other than it is only happening for Python >= 3.8:

Different runs produce this error not the same number of times.
Error happens whether the loaded resource is a .tar, .zip, or a custom file.
Error happens in tests that only draw a single sample as well as in tests that exhaust the datapipe.

@ejguan @NivekT do you have bandwidth to look into this or should I?

ejguan · 2022-04-12T15:13:15Z

Checking it

ejguan · 2022-04-12T15:39:59Z

I might find where the warnings come from. The tar file is not wrapped with our StreamWrapper class to suppress unclosed warning https://github.com/pytorch/data/blob/fd942eec986db373f76f521528570bfef2f1d22f/torchdata/datapipes/iter/util/tararchiveloader.py#L61.

BTW, I was checking the cub200 example. IIUC, the resources are tar files.

vision/torchvision/prototype/datasets/_builtin/cub200.py

Lines 69 to 81 in 467b841

    
           if self._year == "2011": 
        
               archive = GDriveResource( 
        
                   "1hbzc_P1FuxMkcabkgn9ZKinBwW683j45", 
        
                   file_name="CUB_200_2011.tgz", 
        
                   sha256="0c685df5597a8b24909f6a7c9db6d11e008733779a671760afef78feb49bf081", 
        
                   preprocess="decompress", 
        
               ) 
        
               segmentations = GDriveResource( 
        
                   "1EamOKGLoTuZdtcVYbHMWNpkn3iAVj8TP", 
        
                   file_name="segmentations.tgz", 
        
                   sha256="dc77f6cffea0cbe2e41d4201115c8f29a6320ecb04fffd2444f51b8066e4b84f", 
        
                   preprocess="decompress", 
        
               )

But, I couldn't find the code to extract files from the tar file. Could you please provide the code pointer?

But, how about the custom file? What kind of files are you referring to?

NivekT · 2022-04-12T16:02:41Z

I might find where the warnings come from. The tar file is not wrapped with our StreamWrapper class to suppress unclosed warning https://github.com/pytorch/data/blob/fd942eec986db373f76f521528570bfef2f1d22f/torchdata/datapipes/iter/util/tararchiveloader.py#L61.

BTW, I was checking the cub200 example. IIUC, the resources are tar files.

vision/torchvision/prototype/datasets/_builtin/cub200.py

Lines 69 to 81 in 467b841

if self._year == "2011":

archive = GDriveResource(

"1hbzc_P1FuxMkcabkgn9ZKinBwW683j45",

file_name="CUB_200_2011.tgz",

sha256="0c685df5597a8b24909f6a7c9db6d11e008733779a671760afef78feb49bf081",

preprocess="decompress",

)

segmentations = GDriveResource(

"1EamOKGLoTuZdtcVYbHMWNpkn3iAVj8TP",

file_name="segmentations.tgz",

sha256="dc77f6cffea0cbe2e41d4201115c8f29a6320ecb04fffd2444f51b8066e4b84f",

preprocess="decompress",

)

But, I couldn't find the code to extract files from the tar file. Could you please provide the code pointer?
But, how about the custom file? What kind of files are you referring to?

Looks like there is a preprocess step:

Goes through OnlineResource:

vision/torchvision/prototype/datasets/utils/_resource.py

Line 35 in 467b841

preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], pathlib.Path]]] = None,
_decompress

vision/torchvision/datasets/utils.py

Line 343 in 467b841

def _decompress(from_path: str, to_path: Optional[str] = None, remove_finished: bool = False) -> str:
And here:

vision/torchvision/datasets/utils.py

Line 288 in 467b841

_COMPRESSED_FILE_OPENERS: Dict[str, Callable[..., IO]] = {

ejguan · 2022-04-12T16:09:15Z

I will spend more time on this issue.

ejguan · 2022-04-13T02:06:02Z

I might have found the culprit.

For example of cub200, when we construct data pipeline using _datapipe(resources), we load a few file handlers via

vision/torchvision/prototype/datasets/_builtin/cub200.py

Lines 188 to 190 in 467b841

    
           image_files_map = dict( 
        
               (image_id, rel_posix_path.rsplit("/", maxsplit=1)[1]) for image_id, rel_posix_path in image_files_dp 
        
           )

Before creating the dict image_files_map, we have a Demux to hold file handlers. Before the iteration of data-loading pipeline, the Demux has been iterated and the buffer will be filled with unclosed file handlers. But, I am not sure why the file handlers within the buffer are not closed automatically by StreamWrapper.

I assume this part is the root cause because I am able to eliminate the Error from smoke_test for cub200 by adding extra logic to clean up the buffer after creating image_files_map.
images_dp.main_datapipe.child_buffers.clear()

I haven't got enough time to verify whether or not we change image_files_map to IterToMapConverter to support lazy loading would solve the issue
https://github.com/pytorch/data/blob/fd942eec986db373f76f521528570bfef2f1d22f/torchdata/datapipes/map/util/utils.py#L20
Or, adding proper __del__ in Demux.

cc: @NivekT

pmeier · 2022-04-13T06:32:15Z

But, I couldn't find the code to extract files from the tar file. Could you please provide the code pointer?

To complement Kevins comment in #5801 (comment): the loading happens in

vision/torchvision/prototype/datasets/utils/_resource.py

Lines 62 to 72 in 467b841

    
           def _loader(self, path: pathlib.Path) -> IterDataPipe[Tuple[str, IO]]: 
        
               if path.is_dir(): 
        
                   return FileOpener(FileLister(str(path), recursive=True), mode="rb") 
        
               dp = FileOpener(IterableWrapper((str(path),)), mode="rb") 
        
               archive_loader = self._guess_archive_loader(path) 
        
               if archive_loader: 
        
                   dp = archive_loader(dp) 
        
               return dp

Folder are loaded with FileOpener(FileLister(...))
Files are loaded with FileOpener(IterableWrapper(...))
Archives are additionally loaded with the respective loader datapipe to have the same behavior for folders and archives

But, how about the custom file? What kind of files are you referring to?

We have a failing test for

vision/torchvision/prototype/datasets/_builtin/semeion.py

Line 29 in 467b841

class SEMEION(Dataset):

and that uses a .data file (csv in disguise) file and features no Demultiplexer. Upon closer inspection, it seems like the warning is raised at the wrong time:

failed on setup with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <_io.FileIO [closed]>
[...]
Traceback (most recent call last):
  File "/home/circleci/.pyenv/versions/3.10.4/lib/python3.10/unittest/mock.py", line 2153, in __init__
    self.name = name
ResourceWarning: unclosed file <_io.FileIO name='/tmp/pytest-of-circleci/pytest-0/test_num_samples_sbd_train_nov0/sbd/benchmark.tgz' mode='rb' closefd=True>

So actually SBD is the culprit here. I'll go through all of them again to find the actual datasets that are causing this.

pmeier · 2022-04-13T06:42:59Z

Rechecking our and torchdata logs, it turns out these datasets are the culprits:

This eliminates point 1 from #5801 (comment). My guess is that the warning is emitted during garbage collection and thus happens at a non-deterministic time.

Edit: I've forced GC after each test in 7fe20b6 and this confirms my suspicion. Now we only see the errors for the tests above.

pmeier · 2022-04-13T07:01:31Z

As for the other similarities: all culprits feature a Demultiplexer, but cub200 is the only dataset that starts the iteration during the construction of the datapipe. Another pointer is that we have other datasets that feature a Demultiplexer

but don't seem to emit this warning. imagenet even starts iterating during datapipe construction as described in #5801 (comment):

vision/torchvision/prototype/datasets/_builtin/imagenet.py

Lines 175 to 180 in 467b841

    
           meta_dp, label_dp = Demultiplexer( 
        
               devkit_dp, 2, self._classifiy_devkit, drop_none=True, buffer_size=INFINITE_BUFFER_SIZE 
        
           ) 
        
           meta_dp = Mapper(meta_dp, self._extract_categories_and_wnids) 
        
           _, wnids = zip(*next(iter(meta_dp)))

ejguan · 2022-04-13T13:48:50Z

My guess is that the warning is emitted during garbage collection and thus happens at a non-deterministic time.

That's my initial thought. But, the StreamWrapper should suppress all unclosed warning by automatically close them during GC time. And, the Error/Warning is weird because it refers to _io.FileIO but I believe open would return BufferedIO.

ejguan · 2022-04-13T14:25:51Z

but cub200 is the only dataset that starts the iteration during the construction of the datapipe

Yeah. It's the reason that only cub200 fails for the case of test_smoke.

ejguan · 2022-04-13T14:35:21Z

Adding this patch seems solve all the issues based on my local test result.

--- a/torch/utils/data/datapipes/iter/combining.py
+++ b/torch/utils/data/datapipes/iter/combining.py
@@ -371,6 +371,10 @@ class _DemultiplexerIterDataPipe(IterDataPipe):
         self.instance_started = [False] * self.num_instances
         self.main_datapipe_exhausted = False

+    def __del__(self):
+        for q in self.child_buffers:
+            q.clear()
+

@NivekT Do you mind taking a lead to see why StreamWrapper does not close the inner file handler automatically? And how can we add a test to prevent such scenario?

NivekT · 2022-04-14T20:36:32Z

I had a look and tried to reproduce the error locally with torchvision. I played around with cub200 and demux.

This is the simplest test case that I have and it seems to work fine.

dp = IterableWrapper([f'{i}.tar.gz' for i in [1, 2, 3]])
dp = FileOpener(dp, mode='rb').load_from_tar()
dp1, dp2 = dp.demux(2, lambda _: 0)

When the demux DataPipes is no longer needed, the file handles inside its buffer get closed correctly by StreamWrapper.

cub200 seems to be working locally for me as well. Let me know what code snippet you run to have a ResourceWarning.

I have been trying to figure out the following:

If _DemultiplexerIterDataPipe not cleaning up properly, what is the specific situation to cause the buffer to not clean up properly?
- Are there references to those DataPipes inside the buffer elsewhere, such that they don't get deleted?
Is StreamWrapper's __del__ logic causing the issue here?

I am leaning towards the first thing is the issue here, therefore, I think @ejguan's suggestion to specifically clear out the buffer is reasonable and we should do that unless we can figure out what specifically cause the buffer not to clear.

ejguan · 2022-04-18T13:52:27Z

You can reproduce the Error from pytest test/test_prototype_builtin_datasets.py -v -k <pattern of test> @NivekT

pmeier · 2022-04-19T06:33:06Z

I'll try to compile a minimal reproduction outside of our test environment.

pmeier · 2022-04-19T12:43:36Z

I was able to pinpoint it to cases where the buffer of the Demultiplexer was not empty. For the datasets in #5801 (comment) the buffer is empty and thus we are not seeing a failure.

pmeier · 2022-04-20T09:27:27Z

I'm starting to believe this is a test issue. I've switched the runner from Python 3.10 to 3.8 in 42c5e5c and now all errors are gone. Combined with the fact that I was unable to reproduce the errors seen in ce922cc, I think this flaky as hell. Not sure where to go from here.

pmeier · 2022-04-20T12:31:02Z

After devoting quite a significant time, I'm fairly confident it is not worth sinking any more time into this. My hunch is that this is pytest issue, but I'm not sure.

Minimal steps to reproduce:

Use Python >= 3.8
Test a dataset that internally uses a Demultiplexer
Have any data stored in the _Demultiplexer().child_buffers

Even with that, sometimes the error shows up, but sometimes it doesn't. If someone else wants to debug further:

If you try to reproduce outside of the test suite, make sure ResourceWarning's are visible. They are ignored by default by Python. pytest enables them.
It seems that the error is pointing the original archive for being not closed rather the file handles extracted from it.

pmeier · 2022-04-20T12:53:24Z

After some offline discussion with @NicolasHug, we think it is best to just ignore the warnings. I tested on

and CI was green.

ejguan · 2022-04-27T15:14:06Z

After pytorch/pytorch#76345 is landed into latest PyTorch nightly, the Error caused by ResourceWarning is eliminated in the TorchVision domain test for TorchData.
But, an Error pops up due to the lambda function from

vision/test/test_prototype_builtin_datasets.py

Line 138 in 1ac6e8b

collate_fn=lambda batch: batch,

See: https://github.com/pytorch/data/runs/6195902638?check_suite_focus=true

…into prototype-test-3.10

pmeier · 2022-04-28T07:57:24Z

But, an Error pops up due to the lambda function from

I can't reproduce that in our own CI: https://app.circleci.com/pipelines/github/pytorch/vision/17117/workflows/612d3491-b1ec-4b2f-9dee-860a078c65c3/jobs/1387558

pmeier · 2022-04-28T09:56:56Z

The failures in torchdata CI happen for Python 3.[89] on macOS and Python 3.[789] on Windows. The failure is

AttributeError: Can't pickle local object 'TestCommon.test_data_loader.<locals>.<lambda>'

which is correct, but this leads to more questions:

Why aren't we seeing this failure on Linux and even more peculiar not macOS when using Python 3.7?
Why aren't we getting a warning about the lambda usage?

ejguan · 2022-04-28T14:16:52Z

Why aren't we seeing this failure on Linux and even more peculiar not macOS when using Python 3.7?

This is because fork is the default mp method for Linux and macOS (3.7). If you change starting method to spawn, the lambda function won't work.

Why aren't we getting a warning about the lambda usage?

collate_fn is an argument for DataLoader or for DataPipe. So, we didn't get chance to update it. We can add a warning to DataLoader when collate_fn is lambda and num_workers>0.

pmeier · 2022-04-28T14:58:38Z

Superseded by #5914.

upgrade prototype test CI from 3.7 to 3.10

d728127

facebook-github-bot added the cla signed label Apr 11, 2022

pmeier added module: datasets prototype and removed cla signed labels Apr 11, 2022

facebook-github-bot added the cla signed label Apr 11, 2022

pmeier mentioned this pull request Apr 11, 2022

fix deprecation warning for datapipes pytorch/data#324

Closed

pmeier added the module: tests label Apr 11, 2022

try force garbage collection for reproducibility

7fe20b6

confirm error on data in buffer

2c5d53d

pmeier added 2 commits April 19, 2022 17:32

try closing all demux child buffers

3174d0f

try clear demux buffers

ce922cc

pmeier added 6 commits April 20, 2022 11:37

remove unrelated change

0bdc665

remove manually clearing of demux buffers

318f002

do not fail on warning but show traceback instead

e052d0f

ignore ResourceWarning

c53d184

revert CI changes other than Python version

bc6905c

more revert

1626792

pmeier added 2 commits April 20, 2022 14:31

try 3.9

e54fcc3

try 3.10

bb7158e

revert to 3.7

3d2151a

pmeier requested a review from NicolasHug April 20, 2022 12:55

pmeier marked this pull request as ready for review April 20, 2022 12:55

Merge branch 'main' into prototype-test-3.10

8a085e0

ejguan mentioned this pull request Apr 25, 2022

Clean up buffer inside DataPipe instance during destruction pytorch/data#370

Closed

pmeier added 2 commits April 28, 2022 09:20

try 3.8 once again

45459ea

Merge branch 'prototype-test-3.10' of https://github.com/pmeier/vision …

32a5167

…into prototype-test-3.10

This was referenced Apr 28, 2022

fix test_data_loader on Windows and macOS #5912

Merged

Run prototype datasets test also on Windows and macOS? #5550

Closed

pmeier mentioned this pull request Apr 28, 2022

add Windows and macOS CI jobs for prototype tests #5914

Merged

pmeier closed this Apr 28, 2022

NivekT mentioned this pull request May 19, 2022

Moving DataPipe buffers from __iter__ to instance (self) pytorch/data#388

Closed

This was referenced May 23, 2022

expand prototype test matrix to different Python versions #6065

Open

Is our handling of open files safe? pytorch/data#436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upgrade prototype test CI from 3.7 to 3.10 #5801

upgrade prototype test CI from 3.7 to 3.10 #5801

pmeier commented Apr 11, 2022 •

edited

Loading

datumbox commented Apr 11, 2022

pmeier commented Apr 11, 2022

pmeier commented Apr 12, 2022 •

edited

Loading

ejguan commented Apr 12, 2022

ejguan commented Apr 12, 2022 •

edited

Loading

NivekT commented Apr 12, 2022

ejguan commented Apr 12, 2022

ejguan commented Apr 13, 2022 •

edited

Loading

pmeier commented Apr 13, 2022

pmeier commented Apr 13, 2022 •

edited

Loading

pmeier commented Apr 13, 2022 •

edited

Loading

ejguan commented Apr 13, 2022

ejguan commented Apr 13, 2022

ejguan commented Apr 13, 2022 •

edited

Loading

NivekT commented Apr 14, 2022

ejguan commented Apr 18, 2022

pmeier commented Apr 19, 2022

pmeier commented Apr 19, 2022

pmeier commented Apr 20, 2022 •

edited

Loading

pmeier commented Apr 20, 2022

pmeier commented Apr 20, 2022

ejguan commented Apr 27, 2022

pmeier commented Apr 28, 2022

pmeier commented Apr 28, 2022

ejguan commented Apr 28, 2022 •

edited

Loading

pmeier commented Apr 28, 2022

upgrade prototype test CI from 3.7 to 3.10 #5801

upgrade prototype test CI from 3.7 to 3.10 #5801

Conversation

pmeier commented Apr 11, 2022 • edited Loading

datumbox commented Apr 11, 2022

pmeier commented Apr 11, 2022

pmeier commented Apr 12, 2022 • edited Loading

ejguan commented Apr 12, 2022

ejguan commented Apr 12, 2022 • edited Loading

NivekT commented Apr 12, 2022

ejguan commented Apr 12, 2022

ejguan commented Apr 13, 2022 • edited Loading

pmeier commented Apr 13, 2022

pmeier commented Apr 13, 2022 • edited Loading

pmeier commented Apr 13, 2022 • edited Loading

ejguan commented Apr 13, 2022

ejguan commented Apr 13, 2022

ejguan commented Apr 13, 2022 • edited Loading

NivekT commented Apr 14, 2022

ejguan commented Apr 18, 2022

pmeier commented Apr 19, 2022

pmeier commented Apr 19, 2022

pmeier commented Apr 20, 2022 • edited Loading

pmeier commented Apr 20, 2022

pmeier commented Apr 20, 2022

ejguan commented Apr 27, 2022

pmeier commented Apr 28, 2022

pmeier commented Apr 28, 2022

ejguan commented Apr 28, 2022 • edited Loading

pmeier commented Apr 28, 2022

pmeier commented Apr 11, 2022 •

edited

Loading

pmeier commented Apr 12, 2022 •

edited

Loading

ejguan commented Apr 12, 2022 •

edited

Loading

ejguan commented Apr 13, 2022 •

edited

Loading

pmeier commented Apr 13, 2022 •

edited

Loading

pmeier commented Apr 13, 2022 •

edited

Loading

ejguan commented Apr 13, 2022 •

edited

Loading

pmeier commented Apr 20, 2022 •

edited

Loading

ejguan commented Apr 28, 2022 •

edited

Loading