-
Notifications
You must be signed in to change notification settings - Fork 730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore concurrency schemes to compensate for blocking ES calls #1161
Comments
Related issue that suggested decoupling the main reconciliation loop and the any ES interactions #1016 which outlines abstracting the notion of an Elasticsearch cluster via a potentially asynchronous API (Draft PR #1116) We discussed in that context, as a potential low level implementation of that idea to decouple the main reconciliation loop from Elasticsearch requests using a task worker concept. Instead of requesting Elasticsearch directly you would run a task, something like type LogicalTime int
// Run triggers execution of the given task at logical time t with params ps.
func (t Tasks) Run(task Task, t LogicalTime, ps Param...)
// ConsumeResult reads the result generated after asOf time with params ps returns nil if not ready.
func (t Tasks) ConsumeResult(t Task, asOf LogicalTime, ps Param...) interface{} // not sure if we can do better?
// Peek is a non-destructive read of the result generated
func (t Tasks) Peek(t Task, t LogicalTime) interface{} |
In a recent discussion we arrived at the conclusion that removal of blocking calls and an async task API as suggested above is not necessary after all for two reasons:
The resulting code should be conceptually much simpler than any asynchronous indirection of HTTP requests as suggested above and the overall capacity of the controller should still be acceptable. That being said it is worth pointing out that the current (and in many ways problematic) observer implementation as an example of asynchronous ES request handling has one advantage over this suggested approach which is that it dynamically adjusts to the number of ES clusters being handled. We start a goroutine for every cluster we manage to observe state/health. In a similar fashion the suggested task API would also dynamically adjust to the number of clusters being managed by spawning a go routine for every asynchronous task. Compared to that improving overall throughput by setting |
The observers seem to be the main bottleneck when managing a large number of Elasticsearch clusters. TLS negotiation and blocking network calls compound the effect as more clusters are added (see #357 (comment)). An interesting avenue to explore might be the delegation of observation work to sidecars so that the operator only needs a single connection to each managed cluster (pull model) or a handful of connections to an event queue (push model). |
Closing in favour of an issue to use parallel reconciliation |
There are several places where we do synchronous calls in the reconciliation loop. Sometimes these calls may not be necessary, but we need to make sure we correctly handle stale state cache.
An example: https://github.com/elastic/cloud-on-k8s/pull/1160/files#r298192964
This is an optimization, not something absolutely necessary to fix.
The text was updated successfully, but these errors were encountered: