[SIEM][ML] Some jobs require 2 gigs of RAM per node in cloud #45316

FrankHassanabad · 2019-09-10T23:35:42Z

Kibana version:
7.4.0-BC3 (cloud)

Original install method (e.g. download page, yum, from source, etc.):
Cloud

Describe the bug:
We should an error on cannot install jobs that require more ML Node memory on page load whenever we have only 1 gig of RAM dedicated for a ML node. Some jobs within the template require more than 1 gig of RAM to run.

Steps to reproduce:
If you select 1 gig of RAM for your ML node (which is the default) like so:

And then go to SIEM page you will get errors on every page load and every time you click the Anomaly button.

Stack traces from the error toaster:

[status_exception] model_memory_limit [512mb] must be less than the value of the xpack.ml.max_model_memory_limit setting [315mb]

{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "model_memory_limit [512mb] must be less than the value of the xpack.ml.max_model_memory_limit setting [315mb]"
      }
    ],
    "type": "status_exception",
    "reason": "model_memory_limit [512mb] must be less than the value of the xpack.ml.max_model_memory_limit setting [315mb]"
  },
  "status": 400
}

Expected behavior:
Not to spam users constantly about not being able to install jobs that require more memory. Instead probably let the user know through another UI/UX friendly way they cannot run some of the jobs with only 1 gig of memory.

Workarounds:
Bump up your 1 gig of RAM of ML to 2 gig, even temporarily, and then load the SIEM page so it can install its jobs. Then you can bump it back down to 1 gig if you want to.

Another option is to manually create the jobs which require more memory in the ML page and give them dummy values so we do not try to create the jobs that require more memory.

elasticmachine · 2019-09-10T23:35:44Z

Pinging @elastic/siem

spong · 2019-09-11T17:45:39Z

@blaklaybul -- to clarify on the implementation details, this issue happens on the installation of jobs, and since we're using the /api/ml/modules/setup/${configTemplate} api to install, we can't programmatically specify which jobs we want to install. So when we detect there's a 'missing' job (in this instance the one that failed to install because of memory constraints), we will continue to try to install via that API call, which results in all the other jobs failing to install (because they already exist), and then we show all the errors, when really we only want to show the one error about memory limitation.

sophiec20 · 2019-09-11T18:11:25Z

Are the jobs failing to be created or are they failing to start?

If they are just failing to start due to space limitations, then you can start individual jobs one at a time using force_start_datafeeds.

We are thinking through how we can better handle memory management wrt many jobs and large jobs and cloud free tiers. This is wip.

cc @droberts195

spong · 2019-09-11T18:46:13Z

They're failing to be created. In this instance it's just the linux_anomalous_process_all_hosts_ecs job within the siem_auditbeat_ecs config template that's failing, as it has a model_memory_limit of 512mb (and a default cloud deployment gives the ML node only 1GB of memory which sets xpack.ml.max_model_memory_limit to be 315mb).

Also, for reference, we do use the force_start_datafeeds API to start the jobs, and as far as I'm aware, we haven't run into any additional issues when trying to start jobs when there are memory constraints. If there isn't enough memory, the job will fail to start, and we'll present the error to the user in an error toast. This error flow is definitely something we can improve upon though.

FrankHassanabad added the Team:SIEM label Sep 10, 2019

FrankHassanabad added the bug Fixes for quality problems that affect the customer experience label Sep 10, 2019

jgowdyelastic mentioned this issue Sep 12, 2019

[ML] Adjusting module jobs model memory limit #45502

Merged

2 tasks

FrankHassanabad closed this as completed Sep 16, 2019

spong mentioned this issue Jan 9, 2020

[SIEM] [ML] Starting a job without enough memory doesn't always show Out of Memory error #54382

Closed

MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIEM][ML] Some jobs require 2 gigs of RAM per node in cloud #45316

[SIEM][ML] Some jobs require 2 gigs of RAM per node in cloud #45316

FrankHassanabad commented Sep 10, 2019 •

edited

Loading

elasticmachine commented Sep 10, 2019

spong commented Sep 11, 2019

sophiec20 commented Sep 11, 2019

spong commented Sep 11, 2019

[SIEM][ML] Some jobs require 2 gigs of RAM per node in cloud #45316

[SIEM][ML] Some jobs require 2 gigs of RAM per node in cloud #45316

Comments

FrankHassanabad commented Sep 10, 2019 • edited Loading

elasticmachine commented Sep 10, 2019

spong commented Sep 11, 2019

sophiec20 commented Sep 11, 2019

spong commented Sep 11, 2019

FrankHassanabad commented Sep 10, 2019 •

edited

Loading