-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Memcache I/O and connection timeouts #675
Comments
In addition, we could add these metrics, to help detect problems:
|
One possibly related thing is that our dev flux is constantly thinking that entries have expired:
|
The problem here is that we abandon an automation run as soon as there's an error. So, if there's an error reading from memcache, you will not be going to space today. |
I think something is grossly wrong here. Timeouts are nothing really to do with flux itself. Maybe memcache is maxed out? Or maybe it's not running? One thing I have seen is that when memcache or nats disconnect/crash, flux struggles to reconnect to the new instances. I'm not sure why yet. Regarding the metrics, most of those should be available. Definitely:
Although this isn't covered:
You've only got the time per single fetch. |
Me neither, but I have noticed that kube-dns crashes an awful lot, and sometimes outbound connections fail because they can't resolve a hostname. |
Yes, this would cause problems. We should definitely investigate this. |
Regarding the I/O timeouts, there's a couple of things that could be at fault. The issues fall into two main camps:
The timeout is actually raised on our end. We have a setting First, resources. This is less likely, but if memcache is maxing out it's cpu/ram then it will struggle to service the request. Some people complain about memcache being swapped out onto disk, even when there is free ram. When any of this happens you will get elongated response times. (https://groups.google.com/forum/#!topic/memcached/H9g48R8-AfQ) Second, and more likely, concurrent requests. The default number of allowed concurrent connections is 1024. The I'll do some more debugging locally to see if I can try to reproduce errors like these (although I don't see them in day-to-day development). |
I am able to provoke timeouts by running images with lots of versions (probably some sock shop images would be good for this purpose), and restarting memcached so that the warmer has to repopulate it. |
- Use MaxIdleCons parameter in memcache library. Ensures that this many connections are available to use. - Set the number of http connections and memcache connections to be equal in the warmer. This is required because the number http fetches cannot exceed the number of memcache connections. If they did, it would return a connection error to memcache. - Create a separate memecache client for warming and for a user reading images from the registry. Fixes #675
This could happen if there were new images in quick succession; each will have a staggered expiry time. But I'm not sure that it is the case, here. (On the other hand I struggle to explain it another way) |
We were explicitly setting this to 100ms in our dev environment, which seemed to catch a lot of the tail of the latency distribution if memcached is under any kind of load. It's probably a reasonable setting if you are checking for the presence of cache entries (and that's probably why it was set to that originally). But, now we're using memcached as a kind of local database, much better to leave it to default and simply slow down. (We removed the argument to let it default to 1s). |
* Set number of http and memcache connections to be equal - Use MaxIdleCons parameter in memcache library. Ensures that this many connections are available to use. - Set the number of http connections and memcache connections to be equal in the warmer. This is required because the number http fetches cannot exceed the number of memcache connections. If they did, it would return a connection error to memcache. - Create a separate memecache client for warming and for a user reading images from the registry. Fixes #675 * #687 review comments - Remove `memcached-connections` parameter - Reworded comments - Renamed variables
I see strings of log entries from the memcache client:
This also affects us in dev, and leads to automated services not being updated, and the UI showing no entries for services (because it either doesn't have a full complement of images, or cannot complete a query for all the images).
The text was updated successfully, but these errors were encountered: