Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Set number of http and memcache connections to be equal #687

Merged
merged 2 commits into from
Aug 25, 2017

Conversation

philwinder
Copy link
Contributor

@philwinder philwinder commented Aug 4, 2017

  • Use MaxIdleCons parameter in memcache library. Ensures that this many connections are available to use.
  • Set the number of http connections and memcache connections to be equal in the warmer. This is required because the number http fetches cannot exceed the number of memcache connections. If they did, it would return a connection error to memcache.
  • Create a separate memecache client for warming and for a user reading images from the registry.

Fixes #675

- Use MaxIdleCons parameter in memcache library. Ensures that this many connections are available to use.
- Set the number of http connections and memcache connections to be equal in the warmer. This is required because the number http fetches cannot exceed the number of memcache connections. If they did, it would return a connection error to memcache.
- Create a separate memecache client for warming and for a user reading images from the registry.

Fixes #675
@philwinder philwinder requested a review from squaremo August 4, 2017 13:35
@squaremo
Copy link
Member

We mitigated the problem we were having ourselves by giving memcache operations a generous timeout. The rationale is that we care less about doing these quickly than we care about succeeding at operations. If memcached is under load, it's better to treat that as backpressure, and slow down, than to repeatedly fail (though quickly).

However: there is merit in this PR I think, since we can still try to make things run quickly when they can. It seems reasonable to me that letting the client keep a larger number of idle connections around, as the only user of the memcached in question, is an improvement.

I'm less convinced that we understand how the fetching and storing perform (i.e., the argument that the request burst number and the max idle connections should be the same), or that this is the end of the story. So I think some more digging is warranted here.

@philwinder
Copy link
Contributor Author

philwinder commented Aug 24, 2017

There are a few improvements in this PR.

  1. We update the memcache version to use the MaxIdleCons setting. This setting initially creates x number of connections to memcache and holds them open.

    This is better than requesting new connections because memcache is slow to open new connections (it handles connection requests on the same thread).

  2. We create a new memcache client for the warmer and the registry. The idea behind this is that the warmer needs lots of connections to do lots of reading and writing. The registry connection only needs a small number because a user/the web UI is unlikely to make concurrent registry requests.

  3. We set the number of http goroutines to match the number of warmer memcache client connnections to be the same.

    This is because if there are more HTTP goroutines, there will not be enough memcache connections to service each remote GET. I.e. remember that the code first does a memcache GET to see if it has the item in the cache. If not, it then does a memcache PUT to update it.

    Hence, if we're getting lots of manifests all at the same time in the HTTP goroutines, we need the corresponding number of memcache connections to check whether we have already cached it and set it.

So, I think the issues you were talking about are orthogonal to this PR. The issues you are talking about refer to the total timeout for getting a set of manifests for a single image. Currently we're at a 10s timeout. If it takes longer than that, all the other requests will get cancelled. Also see #713.

@squaremo
Copy link
Member

There are a few improvements in this PR.

We update the memcache version to use the MaxIdleCons setting. This setting initially creates x number of connections to memcache and holds them open.

I thought at first MaxIdleConns might put a cap on the number of open connections, and/or create them ahead of time; but it doesn't. The client just closes any connections in excess of MaxIdleConns (per memcached address) after using them, and it doesn't open any ahead of time.

This is better than requesting new connections because memcache is slow to open new connections (it handles connection requests on the same thread).

Did we get evidence that that was the specific problem we're having? I remember seeing both connection timeouts and I/O timeouts in logs (more so the latter, when stressing it locally).

We create a new memcache client for the warmer and the registry. The idea behind this is that the warmer needs lots of connections to do lots of reading and writing. The registry connection only needs a small number because a user/the web UI is unlikely to make concurrent registry requests.

There's also the automator checking for new versions, though that's only once every minute (unless people decide they need it to be more often). I think it's a reasonable change to separate the memcache clients.

We set the number of http goroutines to match the number of warmer memcache client connnections to be the same.

This is because if there are more HTTP goroutines, there will not be enough memcache connections to service each remote GET. I.e. remember that the code first does a memcache GET to see if it has the item in the cache. If not, it then does a memcache PUT to update it.

Hence, if we're getting lots of manifests all at the same time in the HTTP goroutines, we need the corresponding number of memcache connections to check whether we have already cached it and set it.

I don't think things are quite as absolute as you've described here: there are always memcached connections available, since the client will just create them. But I suppose we could expect less churn of connections if we usually have as many connections idle as there are operations to do.

So, I think the issues you were talking about are orthogonal to this PR. The issues you are talking about refer to the total timeout for getting a set of manifests for a single image. Currently we're at a 10s timeout. If it takes longer than that, all the other requests will get cancelled. Also see #713.

Nope, I was definitely talking about memcache operations timing out.

@philwinder
Copy link
Contributor Author

Did we get evidence that that was the specific problem we're having? I remember seeing both connection timeouts and I/O timeouts in logs (more so the latter, when stressing it locally).

The evidence was that without the MaxIdleCons setting, we were getting memcache timeouts. With it, we are not.

there are always memcached connections available, since the client will just create them.

The client will create them, but it creates them too slowly. Imagine you make 1000 HTTP requests. This causes 1000 new memcache connections. Memcache is very slow to create new connections. See documentation here: https://github.com/memcached/memcached/wiki/Performance#max-clients

And there are other complaints around the internet. Their recommendation is:

Use persistent connections ...

This is what maxidlecons does.

Nope, I was definitely talking about memcache operations timing out.

My misunderstanding. The current default is 1s. This can be altered at the command line.

Copy link
Member

@squaremo squaremo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some requests for renaming /redescribing things in light of what MaxIdleConns does. (Please also rewrite the commit message)

defaultMemcacheConnections = 10
// The number of connections chosen for memcache and remote GETs must match.
// If they do not match then the remote either creates too many memcache connections
// or not enough to be efficient.

This comment was marked as abuse.

This comment was marked as abuse.

@@ -78,11 +79,11 @@ func main() {
memcachedHostname = fs.String("memcached-hostname", "", "Hostname for memcached service to use when caching chunks. If empty, no memcached will be used.")
memcachedTimeout = fs.Duration("memcached-timeout", time.Second, "Maximum time to wait before giving up on memcached requests.")
memcachedService = fs.String("memcached-service", "memcached", "SRV service used to discover memcache servers.")
memcachedConnections = fs.Int("memcached-connections", defaultMemcacheConnections, "maximum number of connections to memcache")
memcachedConnections = fs.Int("memcached-connections", defaultMemcacheConnections, "maximum number of registry connections to memcache")

This comment was marked as abuse.

This comment was marked as abuse.

Host: *memcachedHostname,
Service: *memcachedService,
Timeout: *memcachedTimeout,
UpdateInterval: 1 * time.Minute,
Logger: log.NewContext(logger).With("component", "memcached"),
MaxConnections: *memcachedConnections,

This comment was marked as abuse.

This comment was marked as abuse.

Timeout: *memcachedTimeout,
UpdateInterval: 1 * time.Minute,
Logger: log.NewContext(logger).With("component", "memcached"),
MaxConnections: *registryBurst,

This comment was marked as abuse.

This comment was marked as abuse.

@squaremo
Copy link
Member

The client will create them, but it creates them too slowly. Imagine you make 1000 HTTP requests. This causes 1000 new memcache connections. Memcache is very slow to create new connections. See documentation here: https://github.com/memcached/memcached/wiki/Performance#max-clients

Sure, but we're not doing 1000 HTTP requests, we're globally rate limiting them to about a hundred requests at a time.

I do take your point about connection churn -- having more idle connections around is going to help with that. I just don't want to oversell it. Or rather, I don't want to be claiming that the memcache client was being limited to a number of connections -- that's not the case, and therefore not something that's being alleviated here.

@philwinder
Copy link
Contributor Author

philwinder commented Aug 24, 2017

I don't want to be claiming that the memcache client was being limited to a number of connections

Sure thing. But the point stands. We were getting timeout errors because memcache couldn't create connections quickly enough.

I'll fix the review comments.

- Remove `memcached-connections` parameter
- Reworded comments
- Renamed variables
@philwinder philwinder merged commit 31896dd into master Aug 25, 2017
@squaremo squaremo deleted the memcache-io-errors branch June 22, 2018 10:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants