Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Introduce a "starting" datafeed state for lazy jobs #53918

Merged
merged 5 commits into from
Mar 24, 2020

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Mar 21, 2020

It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true. Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.

This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started. The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.

Fixes #53763

It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true.  Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.

This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started.  The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.

Relates elastic#53763
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

Comment on lines +1393 to +1394
* `stopping`: The {dfeed} has been requested to stop gracefully and is
completing its final action.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stopping state is not introduced by this PR - it has existed for a couple of years. But I thought I'd add it while I was modifying the list.

(This PR exposes the starting state externally for the first time, so it's understandable it wasn't previously documented.)

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just left a question.

@@ -554,6 +554,7 @@ public boolean test(PersistentTasksCustomMetaData.PersistentTask<?> persistentTa

// This means we are awaiting a new node to be spun up, ok to return back to the user to await node creation
if (assignment != null && assignment.equals(JobNodeSelector.AWAITING_LAZY_ASSIGNMENT)) {
opened = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to return "opened": true even though the job is not in the opened state yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I changed my mind on - look at the difference between the two commits on this PR to see.

If we return "opened" : false" for lazy jobs then the UI as it stands won't start the datafeed - see https://github.com/elastic/kibana/blob/33af1c154beeef8a80bb2f2b3279ab62632ab664/x-pack/plugins/ml/server/models/job_service/datafeeds.ts#L84-L97 and https://github.com/elastic/kibana/blob/33af1c154beeef8a80bb2f2b3279ab62632ab664/x-pack/plugins/ml/server/models/job_service/datafeeds.ts#L57-L59

You could say that the UI should be changed to accept "opened" : false as an acceptable response to start the datafeed if the job is a lazy job, but that's not great becasue:

  1. Currently that part of the UI source code does not have access to the job configs
  2. There are two ways of specifying laziness (the global lazy nodes setting and the per-job laziness setting), and even if the UI were changed to check the per-job setting it couldn't make a sensible decision on the global lazy nodes setting

So, the response from the open job endpoint has to be sufficient for the UI to answer the question "is it appropriate to start the datafeed?"

I think there are two options:

  1. Leave the response with a single opened field that basically answers that question and change the docs to say that
  2. Revert the second commit on this PR and instead add a second field, "awaiting_lazy_assignment": true, to all three responses (i.e. open anomaly detection job, start datafeed, start data frame analytics job) and change the UI to || the two booleans for the result of its openJob() function

Interestingly start data frame analytics currently does return true for lazily assigned jobs - see

- so there was an inconsistency there before.

Which do you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest a 3rd option?

We can stick to interpreting "opened" and "started" as "it is now opened|started".

The UI could simply check the status code for 200 to interpret "the request to open|start was successful" as it doesn't care about whether it is already "opened|started" by the time it receives the response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure, that is another option. I think that would be a trivial change in the UI code because already >= 400 status codes go through a different control path so it would basically be removing a single if.

I noticed that docs currently don't mention the possibility that opened could be false; it's currently documented that opened is always true, like an acknowledged response but with the field name changed. So doing it this way would mean updating the docs (which is not a problem if we go this way).

The final thing to consider is that starting a data frame analytics job does currently return an acknowledged response with a value hardcoded to true. So if it's important for the response to say whether the job was lazily opened then maybe in a followup PR we should change data frame analytics to return started with a true or false value depending on whether it was lazily assigned for consistency?

None of these options are particularly difficult to implement so I will wait until Monday morning once others have had a chance to comment before making any more code changes to avoid unnecessary churn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Note that I don't feel strongly about whether opened is true when the job started lazily. I just wonder if it could be misleading. If we conclude that we're not concerned with this, then the simplest thing is to do what the PR is doing already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parity with analytics and anomaly is useful, as is treatment for the UI.
So opened: true plus a docs change seems acceptable.

Is there a possibility to add a second response status:opening?
We document that opened: true|false means that the instruction to open the job has been accepted (I suspect we would use acknoweldged:true if we were designing today) . And we can use the status to explain if this is happening immediately or slowly. Just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we would use acknoweldged:true if we were designing today

Yes, I think the least BWC-breaking way we can move towards that is to always have opened: true and that is what is documented (the documentation currently says the value is always true).

Is there a possibility to add a second response status:opening?

This is quite similar to the awaiting_lazy_assignment: true of my second option (although in my second option I was proposing to switch back to opened:false as well, which I think we are now agreeing is not the way to go).

I prefer awaiting_lazy_assignment: true over status:opening because if I was an end user and saw:

{
  "opened": true,
  "status": "opening"
}

then I would probably have to read the docs to work out how to interpret that. Also:

{
  "opened": true,
  "status": "opened"
}

is what most users would see, and that looks like it contains redundancy.

Whereas:

{
  "opened": true,
  "awaiting_lazy_assignment": true
}

and:

{
  "opened": true,
  "awaiting_lazy_assignment": false
}

seem easier to interpret without reading the docs.

So unless anyone strongly objects I will eventually implement that. But I'd rather get this PR merged ASAP since it fixes a nasty bug and then add awaiting_lazy_assignment across both job types and datafeeds in a different PR. It will touch quite a few files (since it means HLRC and docs changes for 3 endpoints) and will obscure the main fix in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's merge PR to fix the bug.

And let's treat the second field as a separate discussion for a future release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #54067 to discuss what to do in 7.8.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit cbe063a into elastic:master Mar 24, 2020
@droberts195 droberts195 deleted the fix_lazy_start_datafeeds branch March 24, 2020 10:01
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Mar 24, 2020
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true.  Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.

This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started.  The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.

Backport of elastic#53918
droberts195 added a commit that referenced this pull request Mar 24, 2020
It is possible for ML jobs to open lazily if the "allow_lazy_open"
option in the job config is set to true.  Such jobs wait in the
"opening" state until a node has sufficient capacity to run them.

This commit fixes the bug that prevented datafeeds for jobs lazily
waiting assignment from being started.  The state of such datafeeds
is "starting", and they can be stopped by the stop datafeed API
while in this state with or without force.

Backport of #53918
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Datafeed does not start when allow_lazy_open is enabled
6 participants