Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Manager fails when inline scripts are disabled #49853

Closed
legrego opened this issue Oct 31, 2019 · 14 comments · Fixed by #94870
Closed

Task Manager fails when inline scripts are disabled #49853

legrego opened this issue Oct 31, 2019 · 14 comments · Fixed by #94870
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@legrego
Copy link
Member

legrego commented Oct 31, 2019

Kibana version: Noticed on 7.5, probably earlier as well

When inline scripts are disabled in Elasticsearch (docs), Kibana's task manager repeatedly logs the following error:

server    log   [07:07:41.457] [error][task_manager] Failed to poll for work: [illegal_argument_exception] cannot execute [inline] scripts :: {"path":"/.kibana_task_manager/_update_by_query","query":{"ignore_unavailable":true,"refresh":true,"max_docs":10,"conflicts":"proceed"},"body":"{\"query\":{\"bool\":{\"must\":[{\"term\":{\"type\":\"task\"}},{\"bool\":{\"must\":[{\"bool\":{\"should\":[{\"bool\":{\"must\":[{\"term\":{\"task.status\":\"idle\"}},{\"range\":{\"task.runAt\":{\"lte\":\"now\"}}}]}},{\"bool\":{\"must\":[{\"bool\":{\"should\":[{\"term\":{\"task.status\":\"running\"}},{\"term\":{\"task.status\":\"claiming\"}}]}},{\"range\":{\"task.retryAt\":{\"lte\":\"now\"}}}]}}]}},{\"bool\":{\"should\":[{\"exists\":{\"field\":\"task.interval\"}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"vis_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}},{\"bool\":{\"must\":[{\"term\":{\"task.taskType\":\"lens_telemetry\"}},{\"range\":{\"task.attempts\":{\"lt\":3}}}]}}]}}]}}]}},\"sort\":{\"_script\":{\"type\":\"number\",\"order\":\"asc\",\"script\":{\"lang\":\"expression\",\"source\":\"doc['task.retryAt'].value || doc['task.runAt'].value\"}}},\"seq_no_primary_term\":true,\"script\":{\"source\":\"ctx._source.task.ownerId=params.ownerId; ctx._source.task.status=params.status; ctx._source.task.retryAt=params.retryAt;\",\"lang\":\"painless\",\"params\":{\"ownerId\":\"kibana:5b2de169-2785-441b-ae8c-186a1936b17d\",\"retryAt\":\"2019-10-31T11:08:11.330Z\",\"status\":\"claiming\"}}}","statusCode":400,"response":"{\"error\":{\"root_cause\":[{\"type\":\"illegal_argument_exception\",\"reason\":\"cannot execute [inline] scripts\"}],\"type\":\"search_phase_execution_exception\",\"reason\":\"all shards failed\",\"phase\":\"query\",\"grouped\":true,\"failed_shards\":[{\"shard\":0,\"index\":\".kibana_task_manager_1\",\"node\":\"3iqPAeBbRwi6VaKknwWwdw\",\"reason\":{\"type\":\"illegal_argument_exception\",\"reason\":\"cannot execute [inline] scripts\"}}],\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"cannot execute [inline] scripts\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"cannot execute [inline] scripts\"}}},\"status\":400}"}

Expected Behavior
The Task Manager should ideally be able to function without inline (or stored) scripts. If that is not possible, then the task manager should probably fail more gracefully instead of polluting the logs. If scripts are a hard requirement for Task Manager to function, then we should at least document it as such.

@legrego legrego added bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:Stack Services labels Oct 31, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services (Team:Stack Services)

@gmmorris
Copy link
Contributor

I've put together a simple fix where we identify when this happens and instead of throwing an error, it logs:

server    log   [13:36:31.774] [task_manager][warn] Task Manager cannot operate when inline scripts are disabled in Elasticsearch

This makes messaging better, but doesn't solve the issue that Task Manager will no longer work if inline scripts are disabled.

We'll have to have a discussion about whether this is a reasonable situation or not.

@legrego
Copy link
Member Author

legrego commented Oct 31, 2019

@gmmorris thanks for the fast response on this issue! I only discovered this because I'm working on an unrelated feature that needs to degrade gracefully if inline -or- stored scripts are disabled in Elasticsearch.

Since it sounds like we'd have at least 2 consumers of this information, I wonder if it'd make sense to have a common service that plugins can use to get this information up-front, instead of parsing an error message after the fact.

@/jasontedor mentioned that we can use the _nodes/settings API to figure out if the script settings have been changed from their defaults. I have a POC in my local feature branch which uses this, and appears to be working:

interface NodeSettingsResponse {
  nodes?: {
    [nodeId: string]: {
      settings: {
        script: {
          allowed_types?: string[];
          allowed_contexts?: string[];
        };
      };
    };
  };
}

const nodeScriptSettings: NodeSettingsResponse = await clusterClient.callAsInternalUser(
        'transport.request',
        {
          method: 'GET',
          path: '/_nodes/settings?filter_path=nodes.*.settings.script',
        }
      );

      const usingDefaultScriptSettings = !nodeScriptSettings.nodes;

      const canUseStoredScripts =
        usingDefaultScriptSettings ||
        Object.values(nodeScriptSettings.nodes!).some(node => {
          const allowedTypes = node.settings.script.allowed_types;
          return !allowedTypes || allowedTypes.includes('stored');
        });

      const canUseInlineScripts =
        usingDefaultScriptSettings ||
        Object.values(nodeScriptSettings.nodes!).some(node => {
          const allowedTypes = node.settings.script.allowed_types;
          return !allowedTypes || allowedTypes.includes('inline');
        });

What do you think?

@gmmorris
Copy link
Contributor

gmmorris commented Nov 1, 2019

Hmm, that's interesting.
Obviously we'd rather detect that early on and avoid causing the error in the first place (even if it means preemptively disabling TM)... for now I think the priority is handling the error gracefully.

I'll try and get that PR merged today so we at least have better messaging around it.

@gmmorris
Copy link
Contributor

gmmorris commented Nov 1, 2019

Regarding NodeSettings I think it would be best if we only had to make that call once, rather than for every plugin to make the call independently, but I'm not sure what the best place for that to live would be. 🤔

@legrego
Copy link
Member Author

legrego commented Nov 1, 2019

Regarding NodeSettings I think it would be best if we only had to make that call once, rather than for every plugin to make the call independently, but I'm not sure what the best place for that to live would be. 🤔

Yep, I agree we shouldn't ask each plugin to make this call -- In the legacy platform, the xpack_main plugin is responsible for querying the _xpack endpoint on a fixed interval, which is similarly used by multiple plugins.

I could see us introducing either a rather specific es_node_settings plugin, or a more general es_info plugin, which could potentially query other endpoints as needed (such as _cluster/settings, _node/settings, _xpack, _xpack/usage, etc) @elastic/kibana-platform / @elastic/kibana-stack-services any thoughts on this?

@gmmorris
Copy link
Contributor

gmmorris commented Nov 1, 2019

A sync discussion raised a couple of ideas to address this properly:

  1. We could "ring fence" certain platform operations so that they can perform operations that have been disabled for user land such as inline scripts. This would allow us to use these features for the stack, without forcing users to allow their own userbase to use these features.
  2. We could tell customers that Kibana needs scripts enabled on Elasticsearch to work properly. We've already found that scripts are used in a few places (Scripted fields, aggregations?, saved objects incrementer fn, 7.4 migrations and now TM)

Further discussion is needed.

@gmmorris
Copy link
Contributor

gmmorris commented Nov 1, 2019

I've merged the improved messaging, but keeping this issue open as TM will still be disabled from 7.5 onwards if the user disables inline scripts

@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@ymao1
Copy link
Contributor

ymao1 commented Mar 4, 2021

This is still an open question what we should do in this case. Should we add documentation for this limitation?

@mikecote
Copy link
Contributor

mikecote commented Mar 4, 2021

From conversations with @kobelb a while ago, I think as long as Kibana doesn't crash, we're ok.

@kobelb
Copy link
Contributor

kobelb commented Mar 4, 2021

The prior thinking is that Kibana has not been explicit about needing scripting to be enabled in Elasticsearch. However, various features in Kibana have become dependent on scripting being enabled. Additionally, we haven't had any reports from users about this causing issues.

Therefore, during 7.x we should ensure that Kibana continues to generally work when scripting is disabled in Elasticsearch, but we're comfortable with some features not working.

Starting in 8.0, we should explicitly document that scripting must be enabled in Elasticsearch and we're comfortable with Kibana failing in a fiery ball of glory when scripting is disabled.

@gmmorris
Copy link
Contributor

gmmorris commented Mar 9, 2021

It's worth verifying that TM doesn't keep polling when it encounters this error.
Previously we just made sure it didn't crash, but now that we have a mechanism for reducing the polling frequency (such as when ES throws errors), it's worth ensuring we handle this state correctly and dial back TM so it doesn't keep polling and logging this error forever.

@mikecote
Copy link
Contributor

mikecote commented Mar 9, 2021

For 7.x: Docs, ensure we're not spamming the console, ensure we're not crashing Kibana.

@YulNaumenko YulNaumenko self-assigned this Mar 9, 2021
@YulNaumenko
Copy link
Contributor

YulNaumenko commented Mar 17, 2021

Summary of the things to fix:

ensure we're not spamming the console

  • We are doing endless spamming now:

Screen Shot 2021-03-17 at 1 45 13 PM

  • As you can see message is bad. The fix introduced in the PR is not working any more, since we migrated TM to the new ES client :-)

ensure we're not crashing Kibana

  • Kibana is starting normally, no need to fix.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
8 participants