Metas -> workers configuration propagation #91

slinkydeveloper · 2023-02-15T11:53:44Z

We need to refine the mechanism for metas to propagate service configurations to workers.

slinkydeveloper · 2023-02-15T13:45:51Z

Some links:

slinkydeveloper · 2023-02-15T13:50:38Z

I'm gonna expand this issue with the details about the two proposals:

eagerly pull configuration when service not known
lazily push configuration to workers

slinkydeveloper · 2023-03-02T10:19:03Z

01/03/23 discussion

There are essentially 2 sub-problems where the metas -> worker configuration propagation mechanism has user visible effects:

PP communicating with a PP living in another worker process
Ingresses

How the propagation physically happens, be it pull/push, tcp/http/grpc, is just an implementation detail, hidden behind some abstraction, such as LocalMeta or NodeMeta or something similar.

Worker-to-worker

Every cross-worker communication is tagged with a "global configuration revision" number.
Every time a worker receives a message from a partition processor located in another worker, it will check for this global revision number. If the local revision number is lower, it will wait to have an updated configuration. This specific "wait" point can be implemented in a bunch of places, be it the network layer or the partition processor. ATM we think the partition processor would be the right place, but it is also conceivable to have a third new component for that. In any case, it makes sense that we don't try to continue processing any additional messages from this worker with a newer global configuration revision than ours, given it is likely that most of those messages will require the new global configuration revision
From this point onward, every component that needs the configuration can be synchronous and assume the configuration is available in the local process, accessible through some ad-hoc interfaces provided by the LocalMeta/NodeMeta

Ingress

The ingress needs the service configuration for:

Extracting the key
Transcode JSON to Protobuf

There are essentially two solutions here, both with their pros-cons:

When the service cannot be found, or when the transcoding json to protobuf fails, wait to receive the next global configuration revision.
- Pro: better user experience, as the user will never get a 404 or a transcoding failure after registering a service.
- Cons: The blocking behaviour makes us susceptible to denial of service attacks, as one can easily fill the ingress global concurrency semaphore by sending empty requests with random service names, or alternatively correct service and method names, but wrong payloads.
(A slightly different variant of the above case) When the service cannot be found, or when the transcoding json to protobuf fails, ask the meta if there is a new service revision for that particular service.
- Pro: Same as above, but simpler to implement (?)
- Cons: More susceptible to denial of service attacks, and requires some implementation complexity (e.g. batching) to avoid overloading metas as well
Never wait on a configuration update, simply fail with correct failure (404, 400, etc). When registering a service to meta, optimistically try to respond to the registration when every known worker has a propagated configuration
- Pro: Fail fast, doesn't suffer from denial of service attacks
- Cons: There are some cases, such as when workers are suffering partitioning, where the user might get a false positive (e.g. a 404 for a service recently registered)

It is conceivable that we support both solutions, and make this behaviour configurable.

slinkydeveloper · 2023-06-21T13:17:04Z

With the recent updates on retries, including #495, I think this discussion becomes much simpler: now we can let the resolution fail when there is a configuration misalignment between workers, and that error will simply be retried.

slinkydeveloper · 2024-03-21T11:12:35Z

@tillrohrmann seems i can close this now

…361ef..1898426f 1898426f Update min/max of protocol version fields in endpoint_manifest_schema 08871e0b Introduce service.Version and discovery.Version enums 2f3e461f Workflow api changes (restatedev#91) 1fa71a5b Add versioning info (restatedev#90) git-subtree-dir: crates/service-protocol/service-protocol git-subtree-split: 1898426fc98c16d704068594cd54394912845ff7

slinkydeveloper mentioned this issue Feb 15, 2023

Implement Meta #40

Closed

slinkydeveloper added the needs-refinement Issues that need further investigation label Feb 15, 2023

slinkydeveloper mentioned this issue Feb 15, 2023

Invoker #88

Merged

slinkydeveloper mentioned this issue Feb 24, 2023

Implement Ingress #114

Merged

This was referenced Mar 8, 2023

Scaffold Meta component #163

Merged

Allow PartitionProcessor to generate ServiceInvocations from Background/Invoke journal entries #164

Merged

This was referenced May 4, 2023

Meta reload might happen after the partition processor starts processing enqueued messages #361

Closed

Let meta reload before worker starts #365

Merged

slinkydeveloper added admin and removed needs-refinement Issues that need further investigation labels Jun 21, 2023

slinkydeveloper closed this as completed Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metas -> workers configuration propagation #91

Metas -> workers configuration propagation #91

slinkydeveloper commented Feb 15, 2023 •

edited

Loading

slinkydeveloper commented Feb 15, 2023

slinkydeveloper commented Feb 15, 2023

slinkydeveloper commented Mar 2, 2023 •

edited

Loading

slinkydeveloper commented Jun 21, 2023

slinkydeveloper commented Mar 21, 2024

Metas -> workers configuration propagation #91

Metas -> workers configuration propagation #91

Comments

slinkydeveloper commented Feb 15, 2023 • edited Loading

slinkydeveloper commented Feb 15, 2023

slinkydeveloper commented Feb 15, 2023

slinkydeveloper commented Mar 2, 2023 • edited Loading

01/03/23 discussion

Worker-to-worker

Ingress

slinkydeveloper commented Jun 21, 2023

slinkydeveloper commented Mar 21, 2024

slinkydeveloper commented Feb 15, 2023 •

edited

Loading

slinkydeveloper commented Mar 2, 2023 •

edited

Loading