-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skipping page https://pypi.org/simple/idna/ because of Content-Type: unknown #5345
Comments
FWIW we're seeing the same thing on Debian Linux, only with PyPy. If this has something to do with the state of the cache, the OS/interpreter may be a red herring and some file only used on that platform is corrupt in the cache. |
@mithrandi is yours also intermittent and affecting different packages each time? Seems like a weird pattern for a corrupt state file, but could be... |
@mithrandi also... I notice that I'm getting this on our jenkins builder – if you're hitting it in twisted (?), then maybe it's in buildbot? I'm wondering if it might have something to do with multiple pips accessing the same cache simultaneously, in CI systems that have less isolation between builds than like, Travis-CI does. |
@njsmith This is my day job's CI; it's a Docker container started by Drone.io. We run two concurrent builds (in separate containers, but sharing the same cache directory), one on CPython 2.7 (which I haven't seen fail this way) and one on PyPy (which seems to be failing quite frequently as of late), so your theory about concurrent access seems to hold up. EDIT: Also, it seems to be totally different packages each time, so the concurrency seems more likely to be at fault than a permanently corrupt something in the cache. |
We've occasionally seen this in CI for pyca/cryptography, but on our Windows builders. Our windows builders are the only ones which share the cache between multiple concurrent builds, so that seems related 🤔 |
Pip has some nice options for controlling cache access, like |
Apparently this problem started right after CDN migration last month and I would really be surprised not to be linked by it. On our CI system I identified at leat 130 failures and errors look like https://gist.github.com/ssbarnea/78df2a80137cca04ca72872b1d14498f#file-pypi-pip-failure-log-L34 The package name is irrelevant, and running again does not reproduce the same error. Disabling the local cache would be a disaster as we had a huge number of builds and we would endup putting a very high load on the CDN. Also I doubt platform has anything to do with this as I can confirm that it does happen on RHEL/Centos/MacOS. |
We identified the problem occurs when multiple jobs run pip install at the same time at the same node. As a workaround, we use |
I think I found a temporary workaround for avoiding this bug on Jenkins slaves with multiple workers: define This means that each executor will have a different cache directory so they would avoid clashing each other. As the number of executors is limited this would limit bit the increase in disk space needed. I hope that my comment would not be considered a reason for not fixing this issue as soon as possible as this bug does really break CI/CD systems. |
I'm not sure how meaningful this is but I noticed in @ssbarnea's log, pip doesn't actually seem to be initiating a new connection to PyPI to get the 304. Here's a debug log I reproduced when the cache is stale and pip reaches back out to PyPI and gets a 304:
Here's the log from https://gist.github.com/ssbarnea/78df2a80137cca04ca72872b1d14498f#file-pypi-pip-failure-log-L34:
Note the lack of a |
I'd be curious if this is reproducible with 9.0.3, looks like two of the related vendored libraries ( |
@di Most of our jobs are first upgrading pip and doing the rest after. This means that this problem reproduced with latest pip 10.0.1. |
If someone could chech if pip 9.0.3 also faces this issue, that would be nice. That'll bring us down to something in pip 10 as the cause of this issue. |
The CI system I work with sees this bug with pip 9.0.1. I also tested with 9.0.3 and confirmed it is affected too. |
I'm also seeing this issue but with a couple differences. First off here's the part of our job that will init the virtualenv and then do pip install: if [[ ! -d .venv ]]; then
virtualenv .venv
fi
source .venv/bin/activate
set -x
rm -f pip-install.log
pip install --log pip-install.log --upgrade pip setuptools
pip install --log pip-install.log --build .venv/tmp -r requirements.txt the jobs are setup to clean the working directory before each run (so essentially the (edit) |
Here's the line where this is happening, FYI: pip/src/pip/_internal/index.py Lines 819 to 831 in 80b9524
|
I don't claim to be an expert on this, but after searching, it seems like it's actually the recommended / expected behavior for 304 responses not to include a "Content-Type" header (which is triggering the "unknown"):
(from https://tools.ietf.org/html/rfc7232#section-4.1 ) If this is the case, then does it mean the CDN serving the 304 is behaving differently from before? |
Presumably any What seems to be happening is that either:
This is speculation on my part however, I haven't been able to reproduce this. We could rule out the first case by inspecting pip's http cache to see if any 304s are in there. Perhaps the second case has something to do with connection pooling? |
If I had a reliable way to reproduce this, I'd modify my pip so that when it happened, it dumped a lot more information about what exactly pip thinks it's looking at (like, all the headers and body). I suspect that seeing the bad data would help us narrow down where it was coming from. |
The problem is here and it seems that my workaround does work, or doesn't always work as I still see failures after implementing it.
But here is the surprise: the place where this happened was a job that was running tox targets in parallel! I bet I will be able to recreate the issue easily now. |
Well, I did few test and it seems that I am unable to replicate the error with parallel execution of tox. I managed to find the cached response, here is its content:
|
That looks like the cache for |
@ssbarnea FWIW you probably want to pip that to |
@alex @dstufft I updated dump of the cached entry. I hope you will be able to come up with a hotfix for it a these issues do cause lots of problems with our builds. My |
Disables parallel tox execution with standard sequential execution. Keeps comment pointing to pip bug that made us disable parallel execution. Removes bash code that was duplicating functionality parallel execution support, a feature which is part of tox, module named detox (shares same issue tracker as tox). Change-Id: I835d0b87fa74bfe0a00a6ee0deb985912129e029 See: pypa/pip#5345
@ssbarnea That still looks like the cache for |
Disables parallel tox execution with standard sequential execution. Keeps comment pointing to pip bug that made us disable parallel execution. Removes bash code that was duplicating functionality parallel execution support, a feature which is part of tox, module named detox (shares same issue tracker as tox). Change-Id: I835d0b87fa74bfe0a00a6ee0deb985912129e029 See: pypa/pip#5345 (cherry picked from commit 94ed037)
BTW, in case anyone else is foolish like me and tries to use XDG_CACHE_HOME on MacOS: it doesn't work, pip doesn't respect that on MacOS. I think setting |
Any chance of getting this fixed because even implementing the workaround in Jenkins seems to be a real struggle because of newly found bug I ended up doing a very ugly workaround in a shared library which overrides sh and injects this fix:
This is because on pipeline job types Jenikins is no longer interpolating environment variables. See https://issues.jenkins-ci.org/browse/JENKINS-51260 ... we can all do |
If you compare the log messages leading up to the problem:
with the code: pip/src/pip/_vendor/cachecontrol/controller.py Lines 199 to 225 in d67d98d
it appears that the response causing the problem isn't from the cache. Otherwise, I'd expect to see a message like the following in the logs:
Also, the docstring for this method says, "Return a cached response if it exists in the cache, otherwise return False." And in this case, from the logs, we can see it's returning False. |
Well a 304 is kind of both. It loads most of the data from the cache and just updates the headers.
…Sent from my iPhone
On May 12, 2018, at 6:22 AM, Chris Jerdonek ***@***.***> wrote:
If you compare the log messages leading up to the problem:
Freshness lifetime from max-age: 600
Freshness lifetime from request max-age: 600
https://pypi.org:443 "GET /simple/idna/ HTTP/1.1" 304 0
with the code: https://github.com/pypa/pip/blob/d67d98dd914e2ce80ece43594554f0a226558db0/src/pip/_vendor/cachecontrol/controller.py#L199-L225
it appears that the response causing the problem isn't from the cache.
Otherwise, I'd expect to see a message like the following in the logs:
The response is "fresh", returning cached response.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
There is something that I so not really understand: clearly the problem appears when pip gets a 404 (not found) message which is clearly not something that should ever be cached. |
It's a 304, not a 404. Pip is looking at the cache, and finding an entry, but the entry is old so pip doesn't trust that it's still accurate. So it asks pypi "hey, have there been any changes to this page since (date)?" and pypi says "304" which means "nope, it's still the same, so you can go ahead and use the version you already have, no need to download it again". |
I also see this issue intermittently on djangoci.com. |
Any update on this issue? I am finding similar issues too |
Use sepatarate pip cache dir per worker and parallel step see pypa/pip#5345
This looks the same as #6970, based on:
We determined that the likely cause is that the file-based CacheControl cache is not safe for concurrent read/write. See my comment here for the specific sequence of events that can lead to this. This should be a lot less likely in 19.3+ because we implemented our own (safer) mechanism for updating the file-based cache, however it is still possible if an external process deletes files from the cache. I will close this eagerly in favor of tracking the "complete" solution in #6970, but if anyone has evidence that this is different, please speak up! |
Description:
We've been hitting an intermittent error in downloading packages from PyPI using pip 10.0.1. So far it seems to only hit our MacOS, python 3.5 CI system (not python 3.6 on MacOS, and not any versions on other OSes). The actual package varies, but we always get an error like
or
We finally managed to trigger the error with pip's
-vvv
flag enabled, and it said:(Full log here)
So to my ignorant eye, it looks like:
https://pypi.org/simple/idna/
pageI don't know how to debug further, so... help?
(our bug: python-trio/trio#508)
The text was updated successfully, but these errors were encountered: