-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Introduce a "starting" datafeed state for lazy jobs #53918
[ML] Introduce a "starting" datafeed state for lazy jobs #53918
Conversation
It is possible for ML jobs to open lazily if the "allow_lazy_open" option in the job config is set to true. Such jobs wait in the "opening" state until a node has sufficient capacity to run them. This commit fixes the bug that prevented datafeeds for jobs lazily waiting assignment from being started. The state of such datafeeds is "starting", and they can be stopped by the stop datafeed API while in this state with or without force. Relates elastic#53763
Pinging @elastic/ml-core (:ml) |
* `stopping`: The {dfeed} has been requested to stop gracefully and is | ||
completing its final action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stopping
state is not introduced by this PR - it has existed for a couple of years. But I thought I'd add it while I was modifying the list.
(This PR exposes the starting
state externally for the first time, so it's understandable it wasn't previously documented.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just left a question.
@@ -554,6 +554,7 @@ public boolean test(PersistentTasksCustomMetaData.PersistentTask<?> persistentTa | |||
|
|||
// This means we are awaiting a new node to be spun up, ok to return back to the user to await node creation | |||
if (assignment != null && assignment.equals(JobNodeSelector.AWAITING_LAZY_ASSIGNMENT)) { | |||
opened = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to return "opened": true
even though the job is not in the opened
state yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I changed my mind on - look at the difference between the two commits on this PR to see.
If we return "opened" : false"
for lazy jobs then the UI as it stands won't start the datafeed - see https://github.com/elastic/kibana/blob/33af1c154beeef8a80bb2f2b3279ab62632ab664/x-pack/plugins/ml/server/models/job_service/datafeeds.ts#L84-L97 and https://github.com/elastic/kibana/blob/33af1c154beeef8a80bb2f2b3279ab62632ab664/x-pack/plugins/ml/server/models/job_service/datafeeds.ts#L57-L59
You could say that the UI should be changed to accept "opened" : false
as an acceptable response to start the datafeed if the job is a lazy job, but that's not great becasue:
- Currently that part of the UI source code does not have access to the job configs
- There are two ways of specifying laziness (the global lazy nodes setting and the per-job laziness setting), and even if the UI were changed to check the per-job setting it couldn't make a sensible decision on the global lazy nodes setting
So, the response from the open job endpoint has to be sufficient for the UI to answer the question "is it appropriate to start the datafeed?"
I think there are two options:
- Leave the response with a single
opened
field that basically answers that question and change the docs to say that - Revert the second commit on this PR and instead add a second field,
"awaiting_lazy_assignment": true
, to all three responses (i.e. open anomaly detection job, start datafeed, start data frame analytics job) and change the UI to||
the two booleans for the result of itsopenJob()
function
Interestingly start data frame analytics currently does return true
for lazily assigned jobs - see
Line 404 in d2740c2
listener.onResponse(new AcknowledgedResponse(true)); |
Which do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest a 3rd option?
We can stick to interpreting "opened" and "started" as "it is now opened|started".
The UI could simply check the status code for 200
to interpret "the request to open|start was successful" as it doesn't care about whether it is already "opened|started" by the time it receives the response.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure, that is another option. I think that would be a trivial change in the UI code because already >= 400 status codes go through a different control path so it would basically be removing a single if
.
I noticed that docs currently don't mention the possibility that opened
could be false
; it's currently documented that opened
is always true
, like an acknowledged
response but with the field name changed. So doing it this way would mean updating the docs (which is not a problem if we go this way).
The final thing to consider is that starting a data frame analytics job does currently return an acknowledged response with a value hardcoded to true
. So if it's important for the response to say whether the job was lazily opened then maybe in a followup PR we should change data frame analytics to return started
with a true
or false
value depending on whether it was lazily assigned for consistency?
None of these options are particularly difficult to implement so I will wait until Monday morning once others have had a chance to comment before making any more code changes to avoid unnecessary churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Note that I don't feel strongly about whether opened
is true when the job started lazily. I just wonder if it could be misleading. If we conclude that we're not concerned with this, then the simplest thing is to do what the PR is doing already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parity with analytics and anomaly is useful, as is treatment for the UI.
So opened: true
plus a docs change seems acceptable.
Is there a possibility to add a second response status:opening
?
We document that opened: true|false
means that the instruction to open the job has been accepted (I suspect we would use acknoweldged:true
if we were designing today) . And we can use the status
to explain if this is happening immediately or slowly. Just a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we would use
acknoweldged:true
if we were designing today
Yes, I think the least BWC-breaking way we can move towards that is to always have opened: true
and that is what is documented (the documentation currently says the value is always true
).
Is there a possibility to add a second response
status:opening
?
This is quite similar to the awaiting_lazy_assignment: true
of my second option (although in my second option I was proposing to switch back to opened:false
as well, which I think we are now agreeing is not the way to go).
I prefer awaiting_lazy_assignment: true
over status:opening
because if I was an end user and saw:
{
"opened": true,
"status": "opening"
}
then I would probably have to read the docs to work out how to interpret that. Also:
{
"opened": true,
"status": "opened"
}
is what most users would see, and that looks like it contains redundancy.
Whereas:
{
"opened": true,
"awaiting_lazy_assignment": true
}
and:
{
"opened": true,
"awaiting_lazy_assignment": false
}
seem easier to interpret without reading the docs.
So unless anyone strongly objects I will eventually implement that. But I'd rather get this PR merged ASAP since it fixes a nasty bug and then add awaiting_lazy_assignment
across both job types and datafeeds in a different PR. It will touch quite a few files (since it means HLRC and docs changes for 3 endpoints) and will obscure the main fix in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's merge PR to fix the bug.
And let's treat the second field as a separate discussion for a future release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #54067 to discuss what to do in 7.8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It is possible for ML jobs to open lazily if the "allow_lazy_open" option in the job config is set to true. Such jobs wait in the "opening" state until a node has sufficient capacity to run them. This commit fixes the bug that prevented datafeeds for jobs lazily waiting assignment from being started. The state of such datafeeds is "starting", and they can be stopped by the stop datafeed API while in this state with or without force. Backport of elastic#53918
It is possible for ML jobs to open lazily if the "allow_lazy_open" option in the job config is set to true. Such jobs wait in the "opening" state until a node has sufficient capacity to run them. This commit fixes the bug that prevented datafeeds for jobs lazily waiting assignment from being started. The state of such datafeeds is "starting", and they can be stopped by the stop datafeed API while in this state with or without force. Backport of #53918
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true. Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.
This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started. The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.
Fixes #53763