Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade prototype test CI from 3.7 to 3.10 #5801

Closed
wants to merge 20 commits into from

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Apr 11, 2022

@ejguan @NivekT starting for Python >= 3.8 we are seeing a lot of I/O warnings that are turned into errors by our test suite. I was under the impression that we don't need to handle file closing. Has this changed?

@datumbox
Copy link
Contributor

@pmeier Sorry for jumping on a draft PR. To clarify, is this PR an investigation or you eventually propose to merge? Typically we want to pin all tests on the earliest supported version, so if indeed the problems disappear we probably don't want to merge this but rather investigate and fix the underlying issue. Probably that's what you have in mind as well but just confirming. Thanks!

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 11, 2022

For now, this is just an investigation. @NicolasHug saw some failures locally that neither I nor CI could reproduce. Turns out, the errors (or better warnings turned into errors) only pop up for Python >= 3.8. This PR confirms that this is indeed the case.

It should not be merged in the current state, but maybe will be the basis for further investigations or fixes. I'll either close it or mark it as "ready to be reviewed".

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 12, 2022

Looking into this more, I can't find any pattern other than it is only happening for Python >= 3.8:

  • Different runs produce this error not the same number of times.
  • Error happens whether the loaded resource is a .tar, .zip, or a custom file.
  • Error happens in tests that only draw a single sample as well as in tests that exhaust the datapipe.

@ejguan @NivekT do you have bandwidth to look into this or should I?

@ejguan
Copy link
Contributor

ejguan commented Apr 12, 2022

Checking it

@ejguan
Copy link
Contributor

ejguan commented Apr 12, 2022

I might find where the warnings come from. The tar file is not wrapped with our StreamWrapper class to suppress unclosed warning https://github.com/pytorch/data/blob/fd942eec986db373f76f521528570bfef2f1d22f/torchdata/datapipes/iter/util/tararchiveloader.py#L61.

BTW, I was checking the cub200 example. IIUC, the resources are tar files.

if self._year == "2011":
archive = GDriveResource(
"1hbzc_P1FuxMkcabkgn9ZKinBwW683j45",
file_name="CUB_200_2011.tgz",
sha256="0c685df5597a8b24909f6a7c9db6d11e008733779a671760afef78feb49bf081",
preprocess="decompress",
)
segmentations = GDriveResource(
"1EamOKGLoTuZdtcVYbHMWNpkn3iAVj8TP",
file_name="segmentations.tgz",
sha256="dc77f6cffea0cbe2e41d4201115c8f29a6320ecb04fffd2444f51b8066e4b84f",
preprocess="decompress",
)

But, I couldn't find the code to extract files from the tar file. Could you please provide the code pointer?

But, how about the custom file? What kind of files are you referring to?

@NivekT
Copy link
Contributor

NivekT commented Apr 12, 2022

I might find where the warnings come from. The tar file is not wrapped with our StreamWrapper class to suppress unclosed warning https://github.com/pytorch/data/blob/fd942eec986db373f76f521528570bfef2f1d22f/torchdata/datapipes/iter/util/tararchiveloader.py#L61.

BTW, I was checking the cub200 example. IIUC, the resources are tar files.

if self._year == "2011":
archive = GDriveResource(
"1hbzc_P1FuxMkcabkgn9ZKinBwW683j45",
file_name="CUB_200_2011.tgz",
sha256="0c685df5597a8b24909f6a7c9db6d11e008733779a671760afef78feb49bf081",
preprocess="decompress",
)
segmentations = GDriveResource(
"1EamOKGLoTuZdtcVYbHMWNpkn3iAVj8TP",
file_name="segmentations.tgz",
sha256="dc77f6cffea0cbe2e41d4201115c8f29a6320ecb04fffd2444f51b8066e4b84f",
preprocess="decompress",
)

But, I couldn't find the code to extract files from the tar file. Could you please provide the code pointer?
But, how about the custom file? What kind of files are you referring to?

Looks like there is a preprocess step:

  1. Goes through OnlineResource:
    preprocess: Optional[Union[Literal["decompress", "extract"], Callable[[pathlib.Path], pathlib.Path]]] = None,
  2. _decompress
    def _decompress(from_path: str, to_path: Optional[str] = None, remove_finished: bool = False) -> str:
  3. And here:
    _COMPRESSED_FILE_OPENERS: Dict[str, Callable[..., IO]] = {

@ejguan
Copy link
Contributor

ejguan commented Apr 12, 2022

I will spend more time on this issue.

@ejguan
Copy link
Contributor

ejguan commented Apr 13, 2022

I might have found the culprit.

For example of cub200, when we construct data pipeline using _datapipe(resources), we load a few file handlers via

image_files_map = dict(
(image_id, rel_posix_path.rsplit("/", maxsplit=1)[1]) for image_id, rel_posix_path in image_files_dp
)

Before creating the dict image_files_map, we have a Demux to hold file handlers. Before the iteration of data-loading pipeline, the Demux has been iterated and the buffer will be filled with unclosed file handlers. But, I am not sure why the file handlers within the buffer are not closed automatically by StreamWrapper.

I assume this part is the root cause because I am able to eliminate the Error from smoke_test for cub200 by adding extra logic to clean up the buffer after creating image_files_map.
images_dp.main_datapipe.child_buffers.clear()

I haven't got enough time to verify whether or not we change image_files_map to IterToMapConverter to support lazy loading would solve the issue
https://github.com/pytorch/data/blob/fd942eec986db373f76f521528570bfef2f1d22f/torchdata/datapipes/map/util/utils.py#L20
Or, adding proper __del__ in Demux.

cc: @NivekT

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 13, 2022

But, I couldn't find the code to extract files from the tar file. Could you please provide the code pointer?

To complement Kevins comment in #5801 (comment): the loading happens in

def _loader(self, path: pathlib.Path) -> IterDataPipe[Tuple[str, IO]]:
if path.is_dir():
return FileOpener(FileLister(str(path), recursive=True), mode="rb")
dp = FileOpener(IterableWrapper((str(path),)), mode="rb")
archive_loader = self._guess_archive_loader(path)
if archive_loader:
dp = archive_loader(dp)
return dp

  • Folder are loaded with FileOpener(FileLister(...))
  • Files are loaded with FileOpener(IterableWrapper(...))
  • Archives are additionally loaded with the respective loader datapipe to have the same behavior for folders and archives

But, how about the custom file? What kind of files are you referring to?

We have a failing test for

and that uses a .data file (csv in disguise) file and features no Demultiplexer. Upon closer inspection, it seems like the warning is raised at the wrong time:

failed on setup with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <_io.FileIO [closed]>
[...]
Traceback (most recent call last):
  File "/home/circleci/.pyenv/versions/3.10.4/lib/python3.10/unittest/mock.py", line 2153, in __init__
    self.name = name
ResourceWarning: unclosed file <_io.FileIO name='/tmp/pytest-of-circleci/pytest-0/test_num_samples_sbd_train_nov0/sbd/benchmark.tgz' mode='rb' closefd=True>

So actually SBD is the culprit here. I'll go through all of them again to find the actual datasets that are causing this.

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 13, 2022

Rechecking our and torchdata logs, it turns out these datasets are the culprits:

This eliminates point 1 from #5801 (comment). My guess is that the warning is emitted during garbage collection and thus happens at a non-deterministic time.

Edit: I've forced GC after each test in 7fe20b6 and this confirms my suspicion. Now we only see the errors for the tests above.

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 13, 2022

As for the other similarities: all culprits feature a Demultiplexer, but cub200 is the only dataset that starts the iteration during the construction of the datapipe. Another pointer is that we have other datasets that feature a Demultiplexer

but don't seem to emit this warning. imagenet even starts iterating during datapipe construction as described in #5801 (comment):

meta_dp, label_dp = Demultiplexer(
devkit_dp, 2, self._classifiy_devkit, drop_none=True, buffer_size=INFINITE_BUFFER_SIZE
)
meta_dp = Mapper(meta_dp, self._extract_categories_and_wnids)
_, wnids = zip(*next(iter(meta_dp)))

@ejguan
Copy link
Contributor

ejguan commented Apr 13, 2022

My guess is that the warning is emitted during garbage collection and thus happens at a non-deterministic time.

That's my initial thought. But, the StreamWrapper should suppress all unclosed warning by automatically close them during GC time. And, the Error/Warning is weird because it refers to _io.FileIO but I believe open would return BufferedIO.

@ejguan
Copy link
Contributor

ejguan commented Apr 13, 2022

but cub200 is the only dataset that starts the iteration during the construction of the datapipe

Yeah. It's the reason that only cub200 fails for the case of test_smoke.

@ejguan
Copy link
Contributor

ejguan commented Apr 13, 2022

Adding this patch seems solve all the issues based on my local test result.

--- a/torch/utils/data/datapipes/iter/combining.py
+++ b/torch/utils/data/datapipes/iter/combining.py
@@ -371,6 +371,10 @@ class _DemultiplexerIterDataPipe(IterDataPipe):
         self.instance_started = [False] * self.num_instances
         self.main_datapipe_exhausted = False

+    def __del__(self):
+        for q in self.child_buffers:
+            q.clear()
+

@NivekT Do you mind taking a lead to see why StreamWrapper does not close the inner file handler automatically? And how can we add a test to prevent such scenario?

@NivekT
Copy link
Contributor

NivekT commented Apr 14, 2022

I had a look and tried to reproduce the error locally with torchvision. I played around with cub200 and demux.

This is the simplest test case that I have and it seems to work fine.

dp = IterableWrapper([f'{i}.tar.gz' for i in [1, 2, 3]])
dp = FileOpener(dp, mode='rb').load_from_tar()
dp1, dp2 = dp.demux(2, lambda _: 0)

When the demux DataPipes is no longer needed, the file handles inside its buffer get closed correctly by StreamWrapper.

cub200 seems to be working locally for me as well. Let me know what code snippet you run to have a ResourceWarning.

I have been trying to figure out the following:

  1. If _DemultiplexerIterDataPipe not cleaning up properly, what is the specific situation to cause the buffer to not clean up properly?
    • Are there references to those DataPipes inside the buffer elsewhere, such that they don't get deleted?
  2. Is StreamWrapper's __del__ logic causing the issue here?

I am leaning towards the first thing is the issue here, therefore, I think @ejguan's suggestion to specifically clear out the buffer is reasonable and we should do that unless we can figure out what specifically cause the buffer not to clear.

@ejguan
Copy link
Contributor

ejguan commented Apr 18, 2022

You can reproduce the Error from pytest test/test_prototype_builtin_datasets.py -v -k <pattern of test> @NivekT

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 19, 2022

I'll try to compile a minimal reproduction outside of our test environment.

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 19, 2022

I was able to pinpoint it to cases where the buffer of the Demultiplexer was not empty. For the datasets in #5801 (comment) the buffer is empty and thus we are not seeing a failure.

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 20, 2022

I'm starting to believe this is a test issue. I've switched the runner from Python 3.10 to 3.8 in 42c5e5c and now all errors are gone. Combined with the fact that I was unable to reproduce the errors seen in ce922cc, I think this flaky as hell. Not sure where to go from here.

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 20, 2022

After devoting quite a significant time, I'm fairly confident it is not worth sinking any more time into this. My hunch is that this is pytest issue, but I'm not sure.

Minimal steps to reproduce:

  • Use Python >= 3.8
  • Test a dataset that internally uses a Demultiplexer
  • Have any data stored in the _Demultiplexer().child_buffers

Even with that, sometimes the error shows up, but sometimes it doesn't. If someone else wants to debug further:

  • If you try to reproduce outside of the test suite, make sure ResourceWarning's are visible. They are ignored by default by Python. pytest enables them.
  • It seems that the error is pointing the original archive for being not closed rather the file handles extracted from it.

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 20, 2022

After some offline discussion with @NicolasHug, we think it is best to just ignore the warnings. I tested on

and CI was green.

@pmeier pmeier requested a review from NicolasHug April 20, 2022 12:55
@pmeier pmeier marked this pull request as ready for review April 20, 2022 12:55
@ejguan
Copy link
Contributor

ejguan commented Apr 27, 2022

After pytorch/pytorch#76345 is landed into latest PyTorch nightly, the Error caused by ResourceWarning is eliminated in the TorchVision domain test for TorchData.
But, an Error pops up due to the lambda function from

collate_fn=lambda batch: batch,

See: https://github.com/pytorch/data/runs/6195902638?check_suite_focus=true

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 28, 2022

But, an Error pops up due to the lambda function from

I can't reproduce that in our own CI: https://app.circleci.com/pipelines/github/pytorch/vision/17117/workflows/612d3491-b1ec-4b2f-9dee-860a078c65c3/jobs/1387558

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 28, 2022

The failures in torchdata CI happen for Python 3.[89] on macOS and Python 3.[789] on Windows. The failure is

AttributeError: Can't pickle local object 'TestCommon.test_data_loader.<locals>.<lambda>'

which is correct, but this leads to more questions:

  1. Why aren't we seeing this failure on Linux and even more peculiar not macOS when using Python 3.7?
  2. Why aren't we getting a warning about the lambda usage?

@ejguan
Copy link
Contributor

ejguan commented Apr 28, 2022

  1. Why aren't we seeing this failure on Linux and even more peculiar not macOS when using Python 3.7?

This is because fork is the default mp method for Linux and macOS (3.7). If you change starting method to spawn, the lambda function won't work.

  1. Why aren't we getting a warning about the lambda usage?

collate_fn is an argument for DataLoader or for DataPipe. So, we didn't get chance to update it. We can add a warning to DataLoader when collate_fn is lambda and num_workers>0.

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 28, 2022

Superseded by #5914.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants