-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upgrade prototype test CI from 3.7 to 3.10 #5801
Conversation
@pmeier Sorry for jumping on a draft PR. To clarify, is this PR an investigation or you eventually propose to merge? Typically we want to pin all tests on the earliest supported version, so if indeed the problems disappear we probably don't want to merge this but rather investigate and fix the underlying issue. Probably that's what you have in mind as well but just confirming. Thanks! |
For now, this is just an investigation. @NicolasHug saw some failures locally that neither I nor CI could reproduce. Turns out, the errors (or better warnings turned into errors) only pop up for Python >= 3.8. This PR confirms that this is indeed the case. It should not be merged in the current state, but maybe will be the basis for further investigations or fixes. I'll either close it or mark it as "ready to be reviewed". |
Looking into this more, I can't find any pattern other than it is only happening for Python >= 3.8:
@ejguan @NivekT do you have bandwidth to look into this or should I? |
Checking it |
I might find where the warnings come from. The BTW, I was checking the vision/torchvision/prototype/datasets/_builtin/cub200.py Lines 69 to 81 in 467b841
But, I couldn't find the code to extract files from the tar file. Could you please provide the code pointer? But, how about the |
Looks like there is a
|
I will spend more time on this issue. |
I might have found the culprit. For example of cub200, when we construct data pipeline using vision/torchvision/prototype/datasets/_builtin/cub200.py Lines 188 to 190 in 467b841
Before creating the dict I assume this part is the root cause because I am able to eliminate the Error from I haven't got enough time to verify whether or not we change cc: @NivekT |
To complement Kevins comment in #5801 (comment): the loading happens in vision/torchvision/prototype/datasets/utils/_resource.py Lines 62 to 72 in 467b841
We have a failing test for
and that uses a
So actually SBD is the culprit here. I'll go through all of them again to find the actual datasets that are causing this. |
Rechecking our and torchdata logs, it turns out these datasets are the culprits: This eliminates point 1 from #5801 (comment). My guess is that the warning is emitted during garbage collection and thus happens at a non-deterministic time. Edit: I've forced GC after each test in 7fe20b6 and this confirms my suspicion. Now we only see the errors for the tests above. |
As for the other similarities: all culprits feature a but don't seem to emit this warning. vision/torchvision/prototype/datasets/_builtin/imagenet.py Lines 175 to 180 in 467b841
|
That's my initial thought. But, the StreamWrapper should suppress all unclosed warning by automatically close them during GC time. And, the Error/Warning is weird because it refers to |
Yeah. It's the reason that only |
Adding this patch seems solve all the issues based on my local test result.
@NivekT Do you mind taking a lead to see why |
I had a look and tried to reproduce the error locally with This is the simplest test case that I have and it seems to work fine. dp = IterableWrapper([f'{i}.tar.gz' for i in [1, 2, 3]])
dp = FileOpener(dp, mode='rb').load_from_tar()
dp1, dp2 = dp.demux(2, lambda _: 0) When the
I have been trying to figure out the following:
I am leaning towards the first thing is the issue here, therefore, I think @ejguan's suggestion to specifically clear out the buffer is reasonable and we should do that unless we can figure out what specifically cause the buffer not to clear. |
You can reproduce the Error from |
I'll try to compile a minimal reproduction outside of our test environment. |
I was able to pinpoint it to cases where the buffer of the |
I'm starting to believe this is a test issue. I've switched the runner from Python 3.10 to 3.8 in 42c5e5c and now all errors are gone. Combined with the fact that I was unable to reproduce the errors seen in ce922cc, I think this flaky as hell. Not sure where to go from here. |
After devoting quite a significant time, I'm fairly confident it is not worth sinking any more time into this. My hunch is that this is Minimal steps to reproduce:
Even with that, sometimes the error shows up, but sometimes it doesn't. If someone else wants to debug further:
|
After some offline discussion with @NicolasHug, we think it is best to just ignore the warnings. I tested on and CI was green. |
After pytorch/pytorch#76345 is landed into latest PyTorch nightly, the Error caused by
See: https://github.com/pytorch/data/runs/6195902638?check_suite_focus=true |
…into prototype-test-3.10
I can't reproduce that in our own CI: https://app.circleci.com/pipelines/github/pytorch/vision/17117/workflows/612d3491-b1ec-4b2f-9dee-860a078c65c3/jobs/1387558 |
The failures in
which is correct, but this leads to more questions:
|
This is because
|
Superseded by #5914. |
@ejguan @NivekT starting for Python >= 3.8 we are seeing a lot of I/O warnings that are turned into errors by our test suite. I was under the impression that we don't need to handle file closing. Has this changed?