Explore concurrency schemes to compensate for blocking ES calls #1161

sebgl · 2019-06-27T14:01:23Z

There are several places where we do synchronous calls in the reconciliation loop. Sometimes these calls may not be necessary, but we need to make sure we correctly handle stale state cache.

An example: https://github.com/elastic/cloud-on-k8s/pull/1160/files#r298192964

This is an optimization, not something absolutely necessary to fix.

pebrc · 2019-06-27T15:48:18Z

Related issue that suggested decoupling the main reconciliation loop and the any ES interactions #1016 which outlines abstracting the notion of an Elasticsearch cluster via a potentially asynchronous API (Draft PR #1116)

We discussed in that context, as a potential low level implementation of that idea to decouple the main reconciliation loop from Elasticsearch requests using a task worker concept. Instead of requesting Elasticsearch directly you would run a task, something like

type LogicalTime int

// Run triggers execution of the given task at logical time t with params ps.
func (t Tasks) Run(task Task, t LogicalTime, ps Param...)

// ConsumeResult reads the result generated after asOf time with params ps returns nil if not ready.
func (t Tasks) ConsumeResult(t Task, asOf LogicalTime, ps Param...) interface{} // not sure if we can do better?  

// Peek is a non-destructive read of the result generated
func (t Tasks) Peek(t Task, t LogicalTime) interface{}

pebrc · 2019-10-16T12:19:41Z

In a recent discussion we arrived at the conclusion that removal of blocking calls and an async task API as suggested above is not necessary after all for two reasons:

with the move to StatefulSets a whole lot of orchestration logic has been delegated to the k8s StatefulSet controller and blocking in the Elasticsearch controller has become less of an issue due to that
we could therefore just do blocking HTTP calls in the reconciliation loop and compensate for the loss of throughput by increasing MaxConcurrentReconciles in the controller.Options

The resulting code should be conceptually much simpler than any asynchronous indirection of HTTP requests as suggested above and the overall capacity of the controller should still be acceptable.

That being said it is worth pointing out that the current (and in many ways problematic) observer implementation as an example of asynchronous ES request handling has one advantage over this suggested approach which is that it dynamically adjusts to the number of ES clusters being handled. We start a goroutine for every cluster we manage to observe state/health. In a similar fashion the suggested task API would also dynamically adjust to the number of clusters being managed by spawning a go routine for every asynchronous task. Compared to that improving overall throughput by setting MaxConcurrentReconciles is a static setting we cannot adjust afaik as the controller starts managing more and more clusters.

charith-elastic · 2019-12-03T10:13:09Z

The observers seem to be the main bottleneck when managing a large number of Elasticsearch clusters. TLS negotiation and blocking network calls compound the effect as more clusters are added (see #357 (comment)).

An interesting avenue to explore might be the delegation of observation work to sidecars so that the operator only needs a single connection to each managed cluster (pull model) or a handful of connections to an event queue (push model).

pebrc · 2020-02-24T15:37:02Z

Closing in favour of an issue to use parallel reconciliation

sebgl added the >enhancement label Jun 27, 2019

sebgl mentioned this issue Jun 27, 2019

Optimize cluster mutations with PVC reuse #1160

Closed

sebgl mentioned this issue Aug 8, 2019

Don't clear shard allocation excludes at every reconciliation #1522

Closed

sebgl mentioned this issue Sep 25, 2019

Review Elasticsearch Status Subresource #1781

Closed

david-kow changed the title ~~Avoid blocking ES calls in the reconciliation loops~~ Explore concurrency schemes to compensate for blocking ES calls Oct 18, 2019

barkbay mentioned this issue Nov 25, 2019

Autostart trial #2160

Merged

pebrc closed this as completed Feb 24, 2020

pebrc mentioned this issue Feb 24, 2020

Switch to concurrent reconciliation #2612

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore concurrency schemes to compensate for blocking ES calls #1161

Explore concurrency schemes to compensate for blocking ES calls #1161

sebgl commented Jun 27, 2019

pebrc commented Jun 27, 2019 •

edited

Loading

pebrc commented Oct 16, 2019 •

edited

Loading

charith-elastic commented Dec 3, 2019

pebrc commented Feb 24, 2020 •

edited

Loading

Explore concurrency schemes to compensate for blocking ES calls #1161

Explore concurrency schemes to compensate for blocking ES calls #1161

Comments

sebgl commented Jun 27, 2019

pebrc commented Jun 27, 2019 • edited Loading

pebrc commented Oct 16, 2019 • edited Loading

charith-elastic commented Dec 3, 2019

pebrc commented Feb 24, 2020 • edited Loading

pebrc commented Jun 27, 2019 •

edited

Loading

pebrc commented Oct 16, 2019 •

edited

Loading

pebrc commented Feb 24, 2020 •

edited

Loading