-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Cache warming exceeds GCR rate limiting #778
Comments
This appears to be the result of Google deprecating V1: |
Mitigation: tune the daemon arguments The errors in question are
That last bit is the clue: GCR appear to no longer be serving v1 schema manifests. We assume v1 manifests are available for all images (which seems to be true elsewhere). On the other hand, dockerhub don't seem to serve schema2 manifests for all images, only relatively recent ones. Sigh. Might have to look at the Content-Type to see which. https://gist.github.com/squaremo/b3b832d6a3b9e0cc07fb35b854a66d47 has a program I used to see what the schema2 manifests look like, using the same docker registry client we use (https://github.com/heroku/docker-registry-client). If we can't use schema1 manifests, we can use schema2. It takes an extra step, to fetch the Config blob as shown in the gist, but that has the information we need. It looks like this:
NB the created datetime (though we should check it matches what we get from the schema1 manifests). |
Also, I happen to look at the code that gets auth token and learned there is a client library we can use, which might have proper means of refreshing the token based on the TTL. This is how the client library can be used:
I've copied that code from another thread I had with some folks at Google. Just FYI, thought this MBOI. |
flux:1.0.2-pre appears to work as expected in my dev, test, and prod clusters. Thanks! |
I'm going to close this one, with a coda: The underlying problem was that all requests to GCR were trivially failing (fixed in large part by a back-ported #780, and latterly #801). We have rate limiting on the requests we make: 200 a second (you are limited to 200 requests over a 1s window), with 125 burst (i.e., you can have 125 requests in play at once). If all requests fail quickly, that means you can get 125 or so errors in the log in a very short period, which is certainly alarming. It's not necessarily the case that gcr will throttle or otherwise reject requests out of hand, though you may want to tune the rps and burst down for other reasons (https://cloud.google.com/container-registry/pricing for example). |
Reported by a customer.
Logs for the
weave-flux-agent
show that the agent is making way too many requests in a short period of time:Caller shows as
warming.go
: https://github.com/weaveworks/flux/blob/master/registry/warming.goThis seems to have started happening since the upgrade to 1.0.1
The text was updated successfully, but these errors were encountered: