Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIEM][ML] Some jobs require 2 gigs of RAM per node in cloud #45316

Closed
FrankHassanabad opened this issue Sep 10, 2019 · 4 comments
Closed

[SIEM][ML] Some jobs require 2 gigs of RAM per node in cloud #45316

FrankHassanabad opened this issue Sep 10, 2019 · 4 comments
Labels
bug Fixes for quality problems that affect the customer experience Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM

Comments

@FrankHassanabad
Copy link
Contributor

FrankHassanabad commented Sep 10, 2019

Kibana version:
7.4.0-BC3 (cloud)

Original install method (e.g. download page, yum, from source, etc.):
Cloud

Describe the bug:
We should an error on cannot install jobs that require more ML Node memory on page load whenever we have only 1 gig of RAM dedicated for a ML node. Some jobs within the template require more than 1 gig of RAM to run.

Steps to reproduce:
If you select 1 gig of RAM for your ML node (which is the default) like so:
Screen Shot 2019-09-10 at 5 29 29 PM

And then go to SIEM page you will get errors on every page load and every time you click the Anomaly button.

Screen Shot 2019-09-10 at 5 38 08 PM

Stack traces from the error toaster:

[status_exception] model_memory_limit [512mb] must be less than the value of the xpack.ml.max_model_memory_limit setting [315mb]
{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "model_memory_limit [512mb] must be less than the value of the xpack.ml.max_model_memory_limit setting [315mb]"
      }
    ],
    "type": "status_exception",
    "reason": "model_memory_limit [512mb] must be less than the value of the xpack.ml.max_model_memory_limit setting [315mb]"
  },
  "status": 400
}

Expected behavior:
Not to spam users constantly about not being able to install jobs that require more memory. Instead probably let the user know through another UI/UX friendly way they cannot run some of the jobs with only 1 gig of memory.

Workarounds:
Bump up your 1 gig of RAM of ML to 2 gig, even temporarily, and then load the SIEM page so it can install its jobs. Then you can bump it back down to 1 gig if you want to.

Screen Shot 2019-09-10 at 5 24 54 PM

Another option is to manually create the jobs which require more memory in the ML page and give them dummy values so we do not try to create the jobs that require more memory.

@elasticmachine
Copy link
Contributor

Pinging @elastic/siem

@FrankHassanabad FrankHassanabad added the bug Fixes for quality problems that affect the customer experience label Sep 10, 2019
@spong
Copy link
Member

spong commented Sep 11, 2019

@blaklaybul -- to clarify on the implementation details, this issue happens on the installation of jobs, and since we're using the /api/ml/modules/setup/${configTemplate} api to install, we can't programmatically specify which jobs we want to install. So when we detect there's a 'missing' job (in this instance the one that failed to install because of memory constraints), we will continue to try to install via that API call, which results in all the other jobs failing to install (because they already exist), and then we show all the errors, when really we only want to show the one error about memory limitation.

@sophiec20
Copy link
Contributor

Are the jobs failing to be created or are they failing to start?

If they are just failing to start due to space limitations, then you can start individual jobs one at a time using force_start_datafeeds.

We are thinking through how we can better handle memory management wrt many jobs and large jobs and cloud free tiers. This is wip.

cc @droberts195

@spong
Copy link
Member

spong commented Sep 11, 2019

They're failing to be created. In this instance it's just the linux_anomalous_process_all_hosts_ecs job within the siem_auditbeat_ecs config template that's failing, as it has a model_memory_limit of 512mb (and a default cloud deployment gives the ML node only 1GB of memory which sets xpack.ml.max_model_memory_limit to be 315mb).

Also, for reference, we do use the force_start_datafeeds API to start the jobs, and as far as I'm aware, we haven't run into any additional issues when trying to start jobs when there are memory constraints. If there isn't enough memory, the job will fail to start, and we'll present the error to the user in an error toast. This error flow is definitely something we can improve upon though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM
Projects
None yet
Development

No branches or pull requests

5 participants