Memcache I/O and connection timeouts #675

squaremo · 2017-07-25T09:50:26Z

I see strings of log entries from the memcache client:

630626-ts=2017-07-24T14:03:17Z caller=warming.go:161 component=warmer err="storing manifests in cache: memcache: connect timeout to 172.17.0.24:11211"
630770-ts=2017-07-24T14:03:17Z caller=memcached.go:179 component=memcached err="storing in memcache: memcache: connect timeout to 172.17.0.24:11211"
630912-ts=2017-07-24T14:03:17Z caller=warming.go:161 component=warmer err="storing manifests in cache: memcache: connect timeout to 172.17.0.24:11211"
631056:ts=2017-07-24T14:03:18Z caller=memcached.go:179 component=memcached err="storing in memcache: read tcp 172.17.0.21:47202->172.17.0.24:11211: i/o timeout"
631210:ts=2017-07-24T14:03:18Z caller=warming.go:161 component=warmer err="storing manifests in cache: read tcp 172.17.0.21:47202->172.17.0.24:11211: i/o timeout"
631366-ts=2017-07-24T14:03:19Z caller=memcached.go:179 component=memcached err="storing in memcache: memcache: connect timeout to 172.17.0.24:11211"
631508-ts=2017-07-24T14:03:19Z caller=warming.go:161 component=warmer err="storing manifests in cache: memcache: connect timeout to 172.17.0.24:11211"
631652-ts=2017-07-24T14:03:19Z caller=memcached.go:179 component=memcached err="storing in memcache: memcache: connect timeout to 172.17.0.24:11211"

This also affects us in dev, and leads to automated services not being updated, and the UI showing no entries for services (because it either doesn't have a full complement of images, or cannot complete a query for all the images).

The text was updated successfully, but these errors were encountered:

squaremo · 2017-07-27T21:34:57Z

In addition, we could add these metrics, to help detect problems:

image registry GET failures
when we fail to write what we've fetched to memcache
total time to run through all the images used and fetch anything new
memcache read failures

squaremo · 2017-07-28T13:50:00Z

One possibly related thing is that our dev flux is constantly thinking that entries have expired:

ts=2017-07-28T13:02:31Z caller=warming.go:124 component=warmer fetching=quay.io/weaveworks/billing-ingester:master-6c369ea to-update=1
ts=2017-07-28T13:02:31Z caller=warming.go:127 component=warmer expiring=quay.io/weaveworks/billing-ingester
ts=2017-07-28T13:02:31Z caller=warming.go:167 component=warmer updated=quay.io/weaveworks/billing-ingester
...
ts=2017-07-28T13:26:50Z caller=warming.go:124 component=warmer fetching=quay.io/weaveworks/billing-ingester:master-6c369ea to-update=1
ts=2017-07-28T13:26:50Z caller=warming.go:127 component=warmer expiring=quay.io/weaveworks/billing-ingester
ts=2017-07-28T13:26:50Z caller=warming.go:167 component=warmer updated=quay.io/weaveworks/billing-ingester
...
ts=2017-07-28T13:30:03Z caller=warming.go:124 component=warmer fetching=quay.io/weaveworks/billing-ingester:master-6c369ea to-update=1
ts=2017-07-28T13:30:03Z caller=warming.go:127 component=warmer expiring=quay.io/weaveworks/billing-ingester
ts=2017-07-28T13:30:03Z caller=warming.go:167 component=warmer updated=quay.io/weaveworks/billing-ingester
...

squaremo · 2017-08-01T09:18:35Z

This also affects us in dev, and leads to automated services not being updated

The problem here is that we abandon an automation run as soon as there's an error. So, if there's an error reading from memcache, you will not be going to space today.

philwinder · 2017-08-01T12:36:33Z

I think something is grossly wrong here. Timeouts are nothing really to do with flux itself. Maybe memcache is maxed out? Or maybe it's not running?

One thing I have seen is that when memcache or nats disconnect/crash, flux struggles to reconnect to the new instances. I'm not sure why yet.

Regarding the metrics, most of those should be available. Definitely:

Get failures (flux_client_fetch_duration_seconds{success=false})
when we fail to write what we've fetched to memcache (flux_registry_fetch_duration_seconds)
memcache read failures (flux_registry_fetch_duration_seconds{success=false})

Although this isn't covered:

total time to run through all the images used and fetch anything new

You've only got the time per single fetch.

squaremo · 2017-08-01T12:41:37Z

One thing I have seen is that when memcache or nats disconnect/crash, flux struggles to reconnect to the new instances. I'm not sure why yet.

Me neither, but I have noticed that kube-dns crashes an awful lot, and sometimes outbound connections fail because they can't resolve a hostname.

philwinder · 2017-08-01T12:43:39Z

kube-dns crashes an awful lot, and sometimes outbound connections fail because they can't resolve a hostname.

Yes, this would cause problems. We should definitely investigate this.

philwinder · 2017-08-02T12:21:59Z

Regarding the I/O timeouts, there's a couple of things that could be at fault. The issues fall into two main camps:

Resources
Concurrent read/writes

The timeout is actually raised on our end. We have a setting --memcache-timeout that defaults to 1s. If the request takes longer than that, it will timeout.

First, resources. This is less likely, but if memcache is maxing out it's cpu/ram then it will struggle to service the request. Some people complain about memcache being swapped out onto disk, even when there is free ram. When any of this happens you will get elongated response times. (https://groups.google.com/forum/#!topic/memcached/H9g48R8-AfQ)

Second, and more likely, concurrent requests. The default number of allowed concurrent connections is 1024. The --registry-burst parameter defines the number of concurrent connections. This defaults to 125. BUT, I have seen delays when you concurrently request from memcache all at the same time. I.e. it can handle 1024 connections without dropping connections, but when you request 1024 all at the same time you do get some latency.

I'll do some more debugging locally to see if I can try to reproduce errors like these (although I don't see them in day-to-day development).

squaremo · 2017-08-02T12:42:43Z

I'll do some more debugging locally to see if I can try to reproduce errors like these (although I don't see them in day-to-day development).

I am able to provoke timeouts by running images with lots of versions (probably some sock shop images would be good for this purpose), and restarting memcached so that the warmer has to repopulate it.

- Use MaxIdleCons parameter in memcache library. Ensures that this many connections are available to use. - Set the number of http connections and memcache connections to be equal in the warmer. This is required because the number http fetches cannot exceed the number of memcache connections. If they did, it would return a connection error to memcache. - Create a separate memecache client for warming and for a user reading images from the registry. Fixes #675

squaremo · 2017-08-07T14:33:13Z

flux is constantly thinking that entries have expired

This could happen if there were new images in quick succession; each will have a staggered expiry time. But I'm not sure that it is the case, here. (On the other hand I struggle to explain it another way)

squaremo · 2017-08-24T10:02:16Z

We have a setting --memcache-timeout that defaults to 1s. If the request takes longer than that, it will timeout.

We were explicitly setting this to 100ms in our dev environment, which seemed to catch a lot of the tail of the latency distribution if memcached is under any kind of load. It's probably a reasonable setting if you are checking for the presence of cache entries (and that's probably why it was set to that originally). But, now we're using memcached as a kind of local database, much better to leave it to default and simply slow down. (We removed the argument to let it default to 1s).

* Set number of http and memcache connections to be equal - Use MaxIdleCons parameter in memcache library. Ensures that this many connections are available to use. - Set the number of http connections and memcache connections to be equal in the warmer. This is required because the number http fetches cannot exceed the number of memcache connections. If they did, it would return a connection error to memcache. - Create a separate memecache client for warming and for a user reading images from the registry. Fixes #675 * #687 review comments - Remove `memcached-connections` parameter - Reworded comments - Renamed variables

philwinder mentioned this issue Aug 4, 2017

Set number of http and memcache connections to be equal #687

Merged

philwinder closed this as completed in #687 Aug 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memcache I/O and connection timeouts #675

Memcache I/O and connection timeouts #675

squaremo commented Jul 25, 2017

squaremo commented Jul 27, 2017

squaremo commented Jul 28, 2017

squaremo commented Aug 1, 2017

philwinder commented Aug 1, 2017 •

edited

Loading

squaremo commented Aug 1, 2017

philwinder commented Aug 1, 2017 •

edited

Loading

philwinder commented Aug 2, 2017

squaremo commented Aug 2, 2017

squaremo commented Aug 7, 2017

squaremo commented Aug 24, 2017

Memcache I/O and connection timeouts #675

Memcache I/O and connection timeouts #675

Comments

squaremo commented Jul 25, 2017

squaremo commented Jul 27, 2017

squaremo commented Jul 28, 2017

squaremo commented Aug 1, 2017

philwinder commented Aug 1, 2017 • edited Loading

squaremo commented Aug 1, 2017

philwinder commented Aug 1, 2017 • edited Loading

philwinder commented Aug 2, 2017

squaremo commented Aug 2, 2017

squaremo commented Aug 7, 2017

squaremo commented Aug 24, 2017

philwinder commented Aug 1, 2017 •

edited

Loading

philwinder commented Aug 1, 2017 •

edited

Loading