Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Logs UI] Enhance the Logs UI Analysis setup flow #44407

Closed
Kerry350 opened this issue Aug 29, 2019 · 9 comments · Fixed by #44990
Closed

[Logs UI] Enhance the Logs UI Analysis setup flow #44407

Kerry350 opened this issue Aug 29, 2019 · 9 comments · Fixed by #44990
Assignees
Labels
Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.5.0 v8.0.0

Comments

@Kerry350
Copy link
Contributor

Kerry350 commented Aug 29, 2019

Summary

The analysis setup page should be enhanced to use a wizard / steps style flow. This will utilise a new API to perform cleanup / reset if needed in a failure scenario.

Rationale

In #43774 some cleanup functionality was introduced to try to handle failure scenarios during setup. However, it isn't 100% robust and isn't actually visible to the user.

Acceptance criteria

  • The Analysis setup page should be enhanced to introduce a wizard / steps style flow. The UI will revolve mainly around the EUI Steps component.

  • Step 1 will contain explanatory text and configurable options (as it stands this is start and end times), with a setup button. When the setup button is clicked step 2 stops being disabled.

  • Step 2, once enabled, should display the setup progress / loading state.

    • If this succeeds the step transitions to a complete state, and will display a "View results" button. When clicked this goes to the results screen.
    • If this fails the step transitions to a danger state, displays an error message, and will display a "Try again" button
      • If the "Try again" button is clicked the step will now display the progress of the retry operation. This will then either loop back to display the aforementioned complete or danger state depending on the result.
  • A new API endpoint will be provided to perform the cleanup / reset. The useLogAnalysisCleanup hook should be altered to make use of this new endpoint. In future work this reset API will also be extended to a dedicated reset page (which users can invoke themselves at any time post setup).

@Kerry350 Kerry350 added v8.0.0 Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.5.0 :Logs UI labels Aug 29, 2019
@Kerry350 Kerry350 self-assigned this Aug 29, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-logs-ui

@Kerry350
Copy link
Contributor Author

Kerry350 commented Sep 3, 2019

Okay, I've got a first pass at this implemented on a branch. Everything can be refined, of course 😄 @weltenwort @jasonrhodes @hbharding @Zacqary - wondering if you have any direct feedback here before I open a PR, especially around the caveats.

There are two steps, utilising the EUI Steps component.

Visuals

Step 1:

By default -
Screenshot 2019-09-03 12 49 56

When changing the start and / or end times -

Screenshot 2019-09-03 12 50 04

Interaction with this step is entirely optional for now. In the future this would include extra parameters we need.

Step 2:

By default -

Screenshot 2019-09-03 12 49 56

Contains the "Create resources" button. Interaction with this button then transitions into 3 states, they are:

Setting up or retrying after a failure:

Screenshot 2019-09-03 18 30 10

Failure:

Screenshot 2019-09-03 18 30 59

Clicking "Try again" transitions into the above state in a retry.

Success:

Screenshot 2019-09-03 18 32 51

Shows the "View results" button which transitions through to the results page.

GIFS:

Success flow:

success

Failure flow:

failure

Behind the scenes

When the "Create resources" (I changed this language from "jobs" as I don't think we need the user to know or understand ML terminology) button is clicked a call is made to setup the analysis ML module.

If we receive any errors in the response payload - for the jobs or datafeeds - we transition into the failure state. The important thing to note here is that the jobs and datafeeds may have been created successfully, but there might just be an error starting the datafeed.

The user may now request to "Try again", this makes a call to our cleanup hook. This currently (totally subject to enhancements at the moment) deletes the jobs and the datafeeds. Once cleaned up, another request is made to setup the ML module.

If everything is successful, we simply allow transitioning over to the results.

Caveats

There is a caveat in that, as mentioned above, the jobs and datafeeds may have been created successfully, but something went wrong starting the datafeed. During the explicit setup flow this is immediately flagged as a failure (if there was an error in the payload) and we show an error message and allow retrying.

However, after that failure if the user were to refresh Kibana they would actually land on the results page. This is because the jobs and datafeeds were actually created successfully, the datafeed just didn't start. Our check for "should we show setup or results" allows just a created state to transition to results, doesn't have to be started.

Now, we could of course extend the check to require everything to be created and started but I don't think that's the way to go. There are legitimate reasons for stopping the datafeed (saving resources etc).

With the introduction of the new health status feature in #44414 I think this is a non-issue as the results page would flag up the fact that the datafeed isn't started.

The reason I flag this is purely because it might be disconcerting that you can go from a failure state on the setup page, to a results page after the fact. Maybe we skip flagging start failures as a failure during setup? I'm not sure.

Also, with retrying moving to an explicit action of "Try again" if a user has a failed setup, then comes back at some later date and tries to setup again, it will fail at first ("you have a job with this ID" and so on). I tried to mitigate this before by having the cleanup happen directly after a failure, but I'm not sure that's right either.

Number of steps

@weltenwort I wanted to just briefly touch on why I circled back to 2 steps after we briefly discussed having 3. With 3 steps we actually end up with three problems:

  1. We end up with a "dead state" in step 2 after everything is successful. Step 3 would have the success state with view results, but this leaves nothing to comfortably sit in step 2 at that time. Or duplicate content.

  2. We end up disabling step 3 during the setup process in step 2.

  3. The EUI docs say step states are used mostly as a final step when you need to make some sort of final check., with 3 steps we really end up with two steps with a state.

And whilst I have absolutely no data to back it up, I think having 2 steps over 3 with nothing disabled will just feel like the process was "easier".

@weltenwort
Copy link
Member

weltenwort commented Sep 3, 2019

I think that looks awesome, well done!

"Create resources" (I changed this language from "jobs" as I don't think we need the user to know or understand ML terminology)

While I agree that we don't want to burden the user with too many implementation details, "Create resources" sounds almost too abstract. By mentioning ML jobs we have a few advantages, I think:

  • the user notices that this is something that is based on the ML features she potentially purchased
  • the user has an indication of where to check of something goes wrong and "retry" doesn't fix the problem

However, after that failure if the user were to refresh Kibana they would actually land on the results page.

As you said, I don't think that will be an issue for long.

And whilst I have absolutely no data to back it up, I think having 2 steps over 3 with nothing disabled will just feel like the process was "easier".

The two steps look just fine to me. 👍

While this is not a PR review, one thing that popped out to me was how uninformative the pending state looks with just the spinner. Maybe we can add a "Creating ML jobs..." text besides that?

Overall I think this will help us scale better with more complex parameter choices and failure conditions. 🚀

@weltenwort
Copy link
Member

ℹ️ Talking about failure conditions, I documented an edge case in #44652.

@Kerry350
Copy link
Contributor Author

Kerry350 commented Sep 4, 2019

By mentioning ML jobs we have a few advantages, I think:

Fair points, I'll change this back.

one thing that popped out to me was how uninformative the pending state looks with just the spinner. Maybe we can add a "Creating ML jobs..." text besides that?

Thanks for the suggestion. I'll make that state a bit more informative. 👍

With regards to #44652 I'll add mitigation 1 (stopping the datafeed first) to this work, since it touches that area anyway (unless there's any objections).

@jasonrhodes
Copy link
Member

Looks great to me! I think the success state could stand to have a bit of text too, maybe just "Your jobs have all been set up!" or "All jobs have been successfully set up." or something similar to that?

@Zacqary
Copy link
Contributor

Zacqary commented Sep 6, 2019

Setup flow looks great! I'm actually just noticing a discrepancy in Step 1's wording: "by default, we'll analyze..." vs "by default, Machine Learning analyzes." Let's also use this next iteration to get those two messages consistent.

@sophiec20
Copy link
Contributor

+1 to reverting back to using familiar ML terminology i.e. "jobs". Some users will be already familiar with ML and also it allows more familiarity with the ML docs if required.

From the UI mockups above, it does not seem that the ML errors are being passed back to the user. If the jobs are being started using the module setup(startDatafeed:true) kibana API, then I believe that an error object is returned containing the elasticsearch error reasons.
Are the full errors being displayed? perhaps they will be visible as toast... This will greatly help in the user being able to self-diagnose.

I'd also like to ask if the concept of "Try again" could perhaps be revisited. In reading through some of the connected tickets I see a few reasons why a retry was proposed:

  • If the index did not exist - rather than wait for the ML setup to return failure, perhaps this could be a pre-req check. i.e. if no documents exist in the index then the button should be disabled so the user never tries to create the jobs.
  • If the ML setup fails because the job already exists - is this really a "permanent broken state"? If the job already exists, it may be running and healthy. Alternatively, the job may have been created but the job failed to open or the datafeed failed to start. In which case, jobs could just be started (subject to the underlying error) rather than deleted then started.

I know I do not have full history here, so sorry to go over old ground, but this retry path has not been taken by other integrations and it seems to presume failures. If this is the case, then it's important for ML team to understand why.

@weltenwort
Copy link
Member

weltenwort commented Sep 10, 2019

If the index did not exist - rather than wait for the ML setup to return failure, perhaps this could be a pre-req check. i.e. if no documents exist in the index then the button should be disabled so the user never tries to create the jobs.

That's a good point, we should do that. 👍

If the ML setup fails because the job already exists - is this really a "permanent broken state"? If the job already exists, it may be running and healthy. Alternatively, the job may have been created but the job failed to open or the datafeed failed to start. In which case, jobs could just be started (subject to the underlying error) rather than deleted then started.

If the job was running and healthy the setup screen wouldn't be shown. The idea of the retry is to (a) clean up inconsistent leftovers ourselves and (b) give the user the opportunity to fix the cause of the failure somehow (in a terminal or another tab) and try again. Surfacing the ML error as you suggested above sounds particularly useful to that end. In any case, we don't want to let the user sit there without the ability to take any action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Logs UI Logs UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.5.0 v8.0.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants