-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly support task batches in ReservedStateUpdateTaskExecutor #116353
Conversation
assertThat("State should be the final state", newState, sameInstance(state2)); | ||
// Only process the final task; the intermediate ones can be skipped | ||
verify(task1, times(0)).execute(any()); | ||
verify(task2, times(1)).execute(any()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I verified that this fails without the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the basic idea yeah, I left a handful of suggestions tho.
return initState; | ||
} | ||
// Only the last update is relevant; the others can be skipped | ||
var taskContext = taskContexts.get(taskContexts.size() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure the tasks are submitted in order, such that the last one in the list is always the best one to apply? Deserves a comment and an assertion I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... is there some other way to define the "order" of tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, but I understood that file-based settings have some notion of version (carried in the file) to ensure we're not overriding fresh settings with stale ones. I believe it relates to task.stateChunk.metadata().version()
, although I haven't tracked down the details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, task.stateChunk.metadata().version()
will give you the version of the file-settings file this update task was created for.
This is then compared against the last-processed version stored in metadata.
We should pick the update task with the highest version as @DaveCTurner suggests, unless someone from core infra confirms that the tasks are already guaranteed to be "well-ordered".
try (var ignored = taskContext.captureResponseHeaders()) { | ||
var task = taskContext.getTask(); | ||
clusterState = task.execute(clusterState); | ||
taskContext.success(() -> task.listener().onResponse(ActionResponse.Empty.INSTANCE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to mark all the other tasks as having succeeded too. There's an assertion to check this in the MasterService
, so I guess we're missing a test that runs this executor in a sufficiently realistic setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Indeed. I'm not a huge fan of mocks TBH because it's so hard to tell what you're actually testing.</rant>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you could perhaps test this with an ESSingleNodeTestCase
to get something more realistic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it'd be good to have a test that tries to create the same repository twice in a batch in order to demonstrate the NPE is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly RepositoriesFileSettingsIT
is a good place or FileSettingsServiceIT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out loud on how this might look in an IT:
In RepositoriesFileSettingsIT
, we could get a FileSettingsService
instance via internalCluster().getInstance(FileSettingsService.class, masterNode)
and manually invoke processFileOnServiceStart()
multiple times concurrently. That would result in multiple concurrent and identical state update tasks, though I'm not sure it would suffice to guarantee that they'd all be part of the same batch.
Based on the MasterService#createTaskQueue
Javadoc, "Tasks submitted to the same queue (while the master service is otherwise busy) will be batched together into a single cluster state update." -- I'm not sure if there's a simpler way in tests to force batching, aside from ensuring the master is busy with another task when we call processFileOnServiceStart
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an ESIntegTestCase
or an ESSingleNodeTestCase
you can force batching by first blocking the master service and then doing whatever is needed to submit the tasks you want to be batched, then unblocking the master service again:
masterNodeClusterService.createTaskQueue("block", Priority.NORMAL, batchExecutionContext -> {
safeAwait(barrier);
safeAwait(barrier);
batchExecutionContext.taskContexts().forEach(c -> c.success(() -> {}));
return batchExecutionContext.initialState();
}).submitTask("block", ESTestCase::fail, null);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update - I think I have confirmed everything below now. No need to read it in detail.
Hi @DaveCTurner - I've finally come back to this. I'm trying to understand your advice here. Looks like the code above is similar to this code:
masterNodeClusterService.createTaskQueue("block", Priority.NORMAL, batchExecutionContext -> { safeAwait(barrier); safeAwait(barrier); batchExecutionContext.taskContexts().forEach(c -> c.success(() -> {})); return batchExecutionContext.initialState(); }).submitTask("block", ESTestCase::fail, null);
This seems to be a trick to block
masterNodeClusterService
until the other test code encounters the barrier twice. You put whatever code you want to run during the block between the two barriers and it's guaranteed to run after this lamba begins but before it does theforEach
.As far as I can tell, the rest is just "legalese" required to make this trick work, and I can pretty much ignore it.
Am I close?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok new question... I'm not sure how to trigger this condition in the integration test. I tried bad JSON but that fails in a different way: it never seems to make it to ReservedStateUpdateTaskExecutor
because it dies earlier, parsing the JSON.
I wonder how I can feed this valid JSON that parses, but then somehow the ReservedStateUpdateTask
still fails? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Welp I guess I can at least test the happy path. Submit N changes and verify that the last one takes effect.
clusterState = task.execute(clusterState); | ||
taskContext.success(() -> task.listener().onResponse(ActionResponse.Empty.INSTANCE)); | ||
} catch (Exception e) { | ||
taskContext.onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the last task fails, and there are earlier tasks in the list, should we try them instead of completely giving up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to try multiple tasks? I thought the motivation for this PR was to avoid doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the task failed then the cluster state remains unchanged, so yes in that case it's safe to try a different one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we're effectively trying them in reverse until one succeeds.
Makes sense I guess, though it seems a little odd. I wonder if fixing it so they can apply in order would be more going "with the grain".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's also a possibility but to achieve that you'd have to go through all the ClusterStateTaskExecutor
implementations to which the reserved-state subsystem delegates checking for batchability problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every failed task will also kick off a separate task to update error state in reserved cluster state metadata, and these error state updates are also subject to the same version checks. So if we have three tasks with versions (v3
highest, v1
lowest):
(t3, v3), (t2, v2), (t1, v1)
We'd apply t3
first. Suppose it fails. Then error state is marked with version v3
. When we process t2
, if it succeeds, we clear the error regardless of version, so we're fine. But if it fails, I think we will skip updating the error state because v3
> v2
.
The code for error handling is in several places, but see here for example
I wonder if we should simply try to apply the highest version task for now, then follow up with potential improvements (like trying older tasks, or instead trying the queued tasks in reverse order).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not my understanding. If we fail to update state t3
then we spawn a separate task (or batch thereof) that will run after the current batch is complete, so when we process t2
there's no error state to clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are very likely right, but let me just write out what the flow looks like:
- we submit the state update task here
- the update task takes an error-handler (
ReservedClusterStateService.this::updateErrorState
) - when invoked, that error handlers submits an error state update task (to a separate executor, with a separate queue)
- Inside the cluster-state update task
execute
method, if we run into errors, we trigger the error handler and throw an exception
Given that error update tasks go to a separate executor and task queue, is it still true that they're guaranteed to run after we've processed tasks in the non-error state update queue?
That is, can't it happen that we finish task t3
, which kicks off error task t3_error
, which then completes before we get to task t2
? I'm probably just misunderstanding what order guarantees we have for cluster-state update tasks stored in separate queues and handled by separate executors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We're already processing a fixed batch of tasks here, there's no way to add more tasks to it. Even if the error-update tasks went to the same executor, they'd be added to a new batch of tasks, allowing us to complete the processing of the current batch of tasks undisturbed.
@DaveCTurner thanks for clarifying! The piece I was missing conceptually is that the execution of batches is also ordered so while we're processing one batch, we won't start processing another batch in parallel.
@prdoyle all good to ignore my comments 👍
Yes. We're already processing a fixed batch of tasks here, there's no way to add more tasks to it. Even if the error-update tasks went to the same executor, they'd be added to a new batch of tasks, allowing us to complete the processing of the current batch of tasks undisturbed. |
Alright I think I can do all of the above, though I'd gratefully accept some help with the tests. 😅 |
writeJSONFile(masterNode, testJSON, versionCounter, logger); // Valid but skipped | ||
writeJSONFile(masterNode, testJSON43mb, versionCounter, logger); // The last valid setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we sometimes try 3, I have a hunch that might cover some logic that the 2 case doesn't.
Could we give them different version
values? And not always apply them in ascending order?
Could we wait for the master to pick up one file before writing the next?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good ideas!
Waiting for the master to pick up one file before writing the next would exercise existing functionality, rather than my changes, but it's obviously an important case to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the expected behaviour when the versions are not ascending?
Should ReservedStateUpdateTaskExecutor
be attempting them in descending version order rather than just iterating backward through the batch list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guiding principle is that the expected outcome should be the same whether the tasks are executed in a single batch or one-at-a-time. So yes AIUI I think we should attempt them in descending version order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah of course, that makes perfect sense. I'll make that change (on the assumption that updates with lower version numbers are ignored if processed one-per-batch).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I found this javadoc for TaskContext
:
A task to be executed, along with callbacks for the executor to record the outcome of this task's execution. The executor must call exactly one of these methods for every task in its batch.
It seems we can't simply skip the intervening tasks after all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to call success
(or onFailure
) to record the outcome of the task, but that doesn't mean you have to actually do anything else with the task. According to the guiding principle, skipping a task because we processed a newer one counts as success right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright makes sense. So perhaps:
- Scan the tasks to identify the one that would have run last, had the tasks all been run individually instead of in a batch, taking into consideration task ordering, version numbers, and the
ReservedStateVersionCheck
mode. - Try that task. If it succeeds, call
success
on that one and all prior ones, and possibly on all following ones? - If it fails, call
onFailure
, remove it from the candidates list, and loop back to step 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds right although maybe it'd be simpler just to sort the list of tasks in the right order ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was afraid I couldn't determine the right order before I know whether the first fails. But now that I think about it, if the first task succeeds, the rest are irrelevant, so I could sort them on the assumption that each one fails.
Latest problem: if I overwrite the file multiple times in succession, there's no guarantee that the watcher registered by |
That's why I suggested waiting for the master to pick up each change and enqueue the relevant task before overwriting the file again. |
Oh I see what you meant. I thought you meant they'd end up in different batches. |
This will be O(n log n) for long lists, while the original way will be O(mn) where m is the number of failing tasks. None of this matters much for performance because n is expected to be small; but it does simplify the code and make it smell less imperative.
Ok, I have implemented the precedence logic we discussed, used sorting over the tasks, and added unit tests to ensure all the tasks are marked completed. All that remains, I think, is to change the IT so it forces multiple file changes to appear in the same batch, in order to prove the original NPE (which motivated this PR) is gone. @DaveCTurner - do you happen to have any tips for how to achieve that? I assume I can use another barrier in a strategic spot to make the file watcher and unit test wait for each other? |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Assuming there's no easy way to wait for the file watcher to run, I think I'd busy-wait, checking See e.g. |
I might need a way to wait for the watcher anyway. The way I've written the IT tests, there's no guarantee all my changes have taken effect before the assertions run. It sometimes results in this:
|
Trying to wait for the watcher ended up deadlocking due to this line, where the watcher waits for the submitted task to complete before processing the next file. This makes me wonder how it's even possible to have a batch containing more than one file update. It seems like |
...but then if that is true, how do I explain my previous comment... 🤔 Update: I think that's a separate issue. That failure is because the test case does not wait for the updates to be processed, which is different from the updates being processed one-per-batch. |
@DaveCTurner - can I get a sanity check here? I think this whole PR might be unnecessary, and the problem could be deeper. File changes are processed on the
This makes it impossible for a second file update task to be created until the first is finished, which implies that it's impossible for two update tasks to be in the same batch. This would imply that what we have observed is impossible, unless more than one thread submits update tasks. I could see this happening in two scenarios:
It seems to be that these are both undesirable, and if we fix those, we no longer need to worry about multiple updates per batch. The first could be fixed with appropriate synchronization / coordination. The second could be fixed by using a singleton executor instead of directly using What do you think? |
Ah interesting, I hadn't looked in much detail at the calling code, I didn't realise it was doing stuff with bare So yes another possibility would be to ensure that there's only ever at most one of these tasks in the master queue at once. Note that today's implementation doesn't guarantee this, because the wait for the task to complete may be interrupted if the node stops being master. It's then possible for the node to become master again straight away, start up a new watcher thread and enqueue another task all while the old task is still in the queue. Instead we must always wait for the task's listener to complete. |
Thanks @DaveCTurner. What would be the Elasticsearch Way to implement this? Is it a good idea to change it? On other projects, I've used |
Normally we would use a threadpool. But we don't want that for file settings because we don't want to be queued up behind other system operations. Essentially this is like having a dedicated fixed size threadpool of size 1. |
Ok I think I'd rather actually use a bina fide dedicated thread pool of size 1 then, so it doesn't accidentally end up with two threads causing otherwise impossible race conditions. |
I'm still not sure we should use ThreadPool. That comes with baggage (eg, size configurable by user through elasticsearch.yml). This is an implementation detail. We should make it robust, but IMO we should do so without ThreadPool. |
Ok makes sense. I really meant lowercase "thread pool" as a general concept. So if you prefer to avoid |
Yes, we do create a dedicated executor in other places - see e.g. |
@DaveCTurner - what if I change |
New PR #118112 that aims to prevent batching rather than cope with it. FYI @DaveCTurner @rjernst |
Superseded by #118112 |
ReservedStateUpdateTaskExecutor
had been extendingSimpleBatchedExecutor
and looping through every state update. Not only was this unnecessary, since only the final state matters; it was also leading toNullPointerException
in certain cases.Here, we simplify the processing to execute only the last task.