-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invoker concurrency quota #548
Conversation
…oal of this commit is to make more manageable testing the state machine logic. Also make sure we unreserve the quota when we abort an invocation task.
…plify the structure. Move quota to a separate file.
Removed Codec from Invoker (this was unused)
4ca6b4f
to
a0df62a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for creating this PR @slinkydeveloper. The changes make sense and look good to me. I had some minor comments that we can resolve before merging this PR.
self.quota.unreserve_slot(); | ||
self.status_store.on_end(&partition, &service_invocation_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we unreserve whenever we end an invocation. Currently, we reserve a slot and only later we call InvocationStatusStore.on_start
. Maybe there is some symmetry and we could combine it so that future changes don't run the risk to forget calling one of them (like whenever we start an invocation we reserve a slot, whenever we end and invocation, we unreserve it).
As a side effect, if we always start when reserving a slot, then InvocationStatusStore.on_failure
wouldn't be called w/o having first called on_start
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not easy to refactor as you propose:
- We reserve a slot when we get the InvokeCommand, and we unreserve it whenever this invoker will "forget" about this sid, meaning it removes the sid from the InvocationStateMachines
InvocationStatusStore.on_start
signals when we actually start the invocation task, setting the in_flight to true and last executed time to now. In particular, when coming from a retry, we need to "flip" the status from in_flight = false to in_flight = true and reset the last executed time to now. Meaning after reserving the quota once, we may invoke several timesInvocationStatusStore.on_start
in case of retries
The only case where it may make sense to unify is on_end
and unreserve_slot
, as they're both triggered when the invoker is "removing" the given sid from its internal state. But TBH I prefer to keep them separate, as it makes more explicit what's going on. We can revisit this later if we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I guess to make it work the status needs to be slightly changed: We create the InvocationStateMachine
when allocating the quota and release the quota when releasing it. That would bind the InvocationStateMachine
to the lifecycle of its slot/quota reservation. Nothing we have to do right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially resolve another asymmetry which is that if we fail on start_invocation_task
, then we need to create a InvocationStateMachine
to call handle_error_event
. Moving the concern of creating this state machine out of the start_invocation_task
could simplify the control flow a tad bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that to create the state machine I need the endpoint metadata first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could become part of the state variants that have a resolved endpoint. I could see that the endpoint metadata might change across restarts if there is a newer endpoint version and no committed journal entries yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see that the endpoint metadata might change across restarts if there is a newer endpoint version and no committed journal entries yet.
I think we should keep this "relaxed" and continue to do what we do today, which is that we don't "fix" the endpoint after resolving it, but we re-resolve on every restart, to allow the user to "fix it" even if there are committed journal entries. This is especially useful if we allow some "configuration options" of the endpoint to be mutable: A case for this could be, for example, that you have made some changes to the gateway between restate and the service endpoint, perhaps requiring a new header, but you still did not configure the header in the restate "endpoint". By adding the header to the endpoint configuration, the runtime will be able again to access again the service endpoint.
bb8d8f3
to
32b6f58
Compare
32b6f58
to
a8222b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. +1 for merging.
Fix #540, introducing a configurable concurrency limit in the invoker. In order to be able to unit test this feature, I had to do quite some changes to allow mocking. Hopefully this is going to be useful in future anyway, to test the invoker state machine transitions in an isolated manner.