Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Memcache I/O and connection timeouts #675

Closed
squaremo opened this issue Jul 25, 2017 · 10 comments
Closed

Memcache I/O and connection timeouts #675

squaremo opened this issue Jul 25, 2017 · 10 comments

Comments

@squaremo
Copy link
Member

I see strings of log entries from the memcache client:

630626-ts=2017-07-24T14:03:17Z caller=warming.go:161 component=warmer err="storing manifests in cache: memcache: connect timeout to 172.17.0.24:11211"
630770-ts=2017-07-24T14:03:17Z caller=memcached.go:179 component=memcached err="storing in memcache: memcache: connect timeout to 172.17.0.24:11211"
630912-ts=2017-07-24T14:03:17Z caller=warming.go:161 component=warmer err="storing manifests in cache: memcache: connect timeout to 172.17.0.24:11211"
631056:ts=2017-07-24T14:03:18Z caller=memcached.go:179 component=memcached err="storing in memcache: read tcp 172.17.0.21:47202->172.17.0.24:11211: i/o timeout"
631210:ts=2017-07-24T14:03:18Z caller=warming.go:161 component=warmer err="storing manifests in cache: read tcp 172.17.0.21:47202->172.17.0.24:11211: i/o timeout"
631366-ts=2017-07-24T14:03:19Z caller=memcached.go:179 component=memcached err="storing in memcache: memcache: connect timeout to 172.17.0.24:11211"
631508-ts=2017-07-24T14:03:19Z caller=warming.go:161 component=warmer err="storing manifests in cache: memcache: connect timeout to 172.17.0.24:11211"
631652-ts=2017-07-24T14:03:19Z caller=memcached.go:179 component=memcached err="storing in memcache: memcache: connect timeout to 172.17.0.24:11211"

This also affects us in dev, and leads to automated services not being updated, and the UI showing no entries for services (because it either doesn't have a full complement of images, or cannot complete a query for all the images).

@squaremo
Copy link
Member Author

In addition, we could add these metrics, to help detect problems:

  • image registry GET failures
  • when we fail to write what we've fetched to memcache
  • total time to run through all the images used and fetch anything new
  • memcache read failures

@squaremo
Copy link
Member Author

One possibly related thing is that our dev flux is constantly thinking that entries have expired:

ts=2017-07-28T13:02:31Z caller=warming.go:124 component=warmer fetching=quay.io/weaveworks/billing-ingester:master-6c369ea to-update=1
ts=2017-07-28T13:02:31Z caller=warming.go:127 component=warmer expiring=quay.io/weaveworks/billing-ingester
ts=2017-07-28T13:02:31Z caller=warming.go:167 component=warmer updated=quay.io/weaveworks/billing-ingester
...
ts=2017-07-28T13:26:50Z caller=warming.go:124 component=warmer fetching=quay.io/weaveworks/billing-ingester:master-6c369ea to-update=1
ts=2017-07-28T13:26:50Z caller=warming.go:127 component=warmer expiring=quay.io/weaveworks/billing-ingester
ts=2017-07-28T13:26:50Z caller=warming.go:167 component=warmer updated=quay.io/weaveworks/billing-ingester
...
ts=2017-07-28T13:30:03Z caller=warming.go:124 component=warmer fetching=quay.io/weaveworks/billing-ingester:master-6c369ea to-update=1
ts=2017-07-28T13:30:03Z caller=warming.go:127 component=warmer expiring=quay.io/weaveworks/billing-ingester
ts=2017-07-28T13:30:03Z caller=warming.go:167 component=warmer updated=quay.io/weaveworks/billing-ingester
...

@squaremo
Copy link
Member Author

squaremo commented Aug 1, 2017

This also affects us in dev, and leads to automated services not being updated

The problem here is that we abandon an automation run as soon as there's an error. So, if there's an error reading from memcache, you will not be going to space today.

@philwinder
Copy link
Contributor

philwinder commented Aug 1, 2017

I think something is grossly wrong here. Timeouts are nothing really to do with flux itself. Maybe memcache is maxed out? Or maybe it's not running?

One thing I have seen is that when memcache or nats disconnect/crash, flux struggles to reconnect to the new instances. I'm not sure why yet.

Regarding the metrics, most of those should be available. Definitely:

  • Get failures (flux_client_fetch_duration_seconds{success=false})
  • when we fail to write what we've fetched to memcache (flux_registry_fetch_duration_seconds)
  • memcache read failures (flux_registry_fetch_duration_seconds{success=false})

Although this isn't covered:

  • total time to run through all the images used and fetch anything new

You've only got the time per single fetch.

@squaremo
Copy link
Member Author

squaremo commented Aug 1, 2017

One thing I have seen is that when memcache or nats disconnect/crash, flux struggles to reconnect to the new instances. I'm not sure why yet.

Me neither, but I have noticed that kube-dns crashes an awful lot, and sometimes outbound connections fail because they can't resolve a hostname.

@philwinder
Copy link
Contributor

philwinder commented Aug 1, 2017

kube-dns crashes an awful lot, and sometimes outbound connections fail because they can't resolve a hostname.

Yes, this would cause problems. We should definitely investigate this.

@philwinder
Copy link
Contributor

Regarding the I/O timeouts, there's a couple of things that could be at fault. The issues fall into two main camps:

  • Resources
  • Concurrent read/writes

The timeout is actually raised on our end. We have a setting --memcache-timeout that defaults to 1s. If the request takes longer than that, it will timeout.

First, resources. This is less likely, but if memcache is maxing out it's cpu/ram then it will struggle to service the request. Some people complain about memcache being swapped out onto disk, even when there is free ram. When any of this happens you will get elongated response times. (https://groups.google.com/forum/#!topic/memcached/H9g48R8-AfQ)

Second, and more likely, concurrent requests. The default number of allowed concurrent connections is 1024. The --registry-burst parameter defines the number of concurrent connections. This defaults to 125. BUT, I have seen delays when you concurrently request from memcache all at the same time. I.e. it can handle 1024 connections without dropping connections, but when you request 1024 all at the same time you do get some latency.

I'll do some more debugging locally to see if I can try to reproduce errors like these (although I don't see them in day-to-day development).

@squaremo
Copy link
Member Author

squaremo commented Aug 2, 2017

I'll do some more debugging locally to see if I can try to reproduce errors like these (although I don't see them in day-to-day development).

I am able to provoke timeouts by running images with lots of versions (probably some sock shop images would be good for this purpose), and restarting memcached so that the warmer has to repopulate it.

philwinder added a commit that referenced this issue Aug 4, 2017
- Use MaxIdleCons parameter in memcache library. Ensures that this many connections are available to use.
- Set the number of http connections and memcache connections to be equal in the warmer. This is required because the number http fetches cannot exceed the number of memcache connections. If they did, it would return a connection error to memcache.
- Create a separate memecache client for warming and for a user reading images from the registry.

Fixes #675
@squaremo
Copy link
Member Author

squaremo commented Aug 7, 2017

flux is constantly thinking that entries have expired

This could happen if there were new images in quick succession; each will have a staggered expiry time. But I'm not sure that it is the case, here. (On the other hand I struggle to explain it another way)

@squaremo
Copy link
Member Author

We have a setting --memcache-timeout that defaults to 1s. If the request takes longer than that, it will timeout.

We were explicitly setting this to 100ms in our dev environment, which seemed to catch a lot of the tail of the latency distribution if memcached is under any kind of load. It's probably a reasonable setting if you are checking for the presence of cache entries (and that's probably why it was set to that originally). But, now we're using memcached as a kind of local database, much better to leave it to default and simply slow down. (We removed the argument to let it default to 1s).

philwinder added a commit that referenced this issue Aug 25, 2017
* Set number of http and memcache connections to be equal

- Use MaxIdleCons parameter in memcache library. Ensures that this many connections are available to use.
- Set the number of http connections and memcache connections to be equal in the warmer. This is required because the number http fetches cannot exceed the number of memcache connections. If they did, it would return a connection error to memcache.
- Create a separate memecache client for warming and for a user reading images from the registry.

Fixes #675

* #687 review comments

- Remove `memcached-connections` parameter
- Reworded comments
- Renamed variables
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants