-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd3 - slow watchers and watcher count blow up #8387
Comments
Is there a custom application using etcd? It could be leaking watches. Client side metrics may help with debugging; see https://godoc.org/github.com/coreos/etcd/clientv3 "Example (metrics)" |
I think I got it, thanks!
BTW do I understand correctly that
Yes, of course. Our apps watch etcd for service discovery. So here is the main question -
Yeap, thank you very much for your help! I will try to integrate the metrics, they will be usefull I believe. Sorry for so many questions :) |
The client keys streams by the ctx string serialization in 3.1; same ctx, same stream. It will only use grpc metadata from the ctx in 3.3. Separate Watcher objects will create separate streams.
No. watch_stream_total is approximately the total number of watch grpc streams on the server. watcher_total is the sum of all watches across all watch grpc streams.
It's related to whether the watch has posted all events to its channel. If the server can't drain events to the client fast enough, the watch is classified as 'slow' and the handling reverts to a batching mode. See https://github.com/coreos/etcd/blob/master/mvcc/watchable_store.go#L184.
Yes. See the store/metrics.go and mvcc/metrics.go
Total events that have been posted to the watch channel but have not been sent over grpc. See mvcc/watchable_store.go and etcdserver/api/v3rpc/watch.go
No, the server resources should be released automatically on a compaction. If not, there's a bug. |
Thanks for metrics explanation 👍
It seems they were really released on compaction in our case - we compact every 24 hours, and it's done somewhen in the night. So OK, let's clarify. The server resources are released:
No other options? Why aren't they released when server responds with some error (compaction error in my example) and the watch channel is going to be closed? Is this a bug or a kind of undocumented feature? |
I meant on a compaction error. If the clientv3 watch channel closes then the resources should be freed automatically by the clientv3 watcher implementation. |
OK, I'll check this case and tell you the results. |
@mwf ping? |
@heyitsanthony I'm deeply sorry, I had some other important stuff to do all this days :( |
@heyitsanthony sorry for taking so long - I had some personal problems during vacation, doesn't matter. I played with server and client code, and couldn't reproduce it in the same way we hit it :( So it looks like someone spams the server in for-loop with new Watch requests without canceling old ones with some tremendous speed :( It doesn't look like server bug, so I'm closing the issue, if I manage to find something else I will let you know. |
etcd server version 3.1.5
Hi guys!
We hit strange issue yesterday, please take a look at this screenshots from Grafana:
Sorry, I had to clean-up host names, otherwise our security team could kill me :)
So, for some unknown reason the number of watchers started to increase and CPU usage increased. The count of watch streams remained stable, but some "slow watchers" appeared.
Could you please explain what's a "watcher" ("etcd_debugging_mvcc_watcher_total" metrics) in terms of MVCC storage, what's the connection with watch streams (I assume watch stream equals the count of
client.Watch()
calls?) and what could lead to such behaviour?The numbers are scaring - 32 million watchers O___O
Everything stabilized by itself. There were several etcd restarts, the last one at 23:20. As you can see on screenshot below - the number of watchers still grew after the restart, but then decreased several times and got normal o___O How did it resolve? What was the core problem?
Also some services hit the issue of Watch freeze - they couldn't receive the key updates like in #7247 and other connected issues you have.
The text was updated successfully, but these errors were encountered: