Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPM] Indexing strategy docs #55301

Merged
merged 4 commits into from
Jan 23, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 36 additions & 1 deletion docs/epm/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,39 @@ A user-specified string that will be used to part of the index name in Elasticse

==== Package

A package contains all the assets for the Elastic Stack. A more detailed definition of a package can be found under https://github.com/elastic/package-registry .
A package contains all the assets for the Elastic Stack. A more detailed definition of a package can be found under https://github.com/elastic/package-registry .


== Indexing Strategy

Ingest Management enforces an indexing strategy to allow the system to automically detect indices and run queries on it. In short the indexing strategy looks as following:

```
{type}-{namespace}-{dataset}
```

The `{type}` can be `logs` or `metrics`. The `{namespace}` is the part where the user can use free form. The only two requirement are that it has only characters allowed in an Elasticsearch index name and does NOT contain a `-`. The `dataset` is defined by the data that is indexed. The same requirements as for the namespace apply. It is expected that the fields for type, namespace and dataset are part of each event and are constant keywords.

Note: More `{type}`s might be added in the future like `apm` and `endpoint`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how do we decide on when we create a new {type} and when to reuse logs. There's a large number of events that feel like logs, but are not logs. Like alerts, network flows, process executions, packetbeat transactions, etc. ECS's event.kind currently lists 6 values. Maybe there's an opportunity to reuse some of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So far the focus was much more on what "types" do we start with and with filebeat and metricbeat the two most obvious were logs and metrics. In general I would like to keep the types to a minimum as Elasticsearch will have to ship with a basic template for these types to make sure everything works out of the box.

The definition of the index should rely on ECS fields. It would have been nice to use {event.type}-{event.namespace}-{event.dataset}. Unfortunately event.type has already a different meaning. It seems what event.type is should actually be event.category.type ;-)

For event.kind: I think it is close but not the same. The value that suprised me at first is pipeline_error as having an error in a pipeline still means the event belongs to the same dataset. It seems events from 1 log file can have different event.kind values.

I think there is also an opportunity for some apps to create their own type like using signal-*-* for SIEM as an example. As far as I understand, this index would only be created when SIEM is used for the first time and could be set up by the SIEM app?

Getting back to the initial question: How do we decide on the types? Lets turn around the question: What are the types you would need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Thinking about which fields this could fit in: stream.type, event.namespace, event.dataset. The reason for the stream.* is that it is what we currently plan to use as the config on the agent side.


This indexing strategy has a few advantages:

* Each index contains only the fields which are relevant for the dataset. This leads to more dense indices and better field completion.
* ILM policies can be applied per namespace per dataset.
* Rollups can be specified per namespace per dataset.
* Having the namespace user configurable makes setting security permissions possible.
* Having a global metrics and logs template, allows to create new indices on demand which still follow the convention. This is common in the case of k8s as an example.
* Constant keywords allow to narrow down the indices we need to access for querying very efficiently. This is especially relevant in environments which a large number of indices or with indices on slower nodes.


=== Templates & ILM Policies

To make the above strategy possible, alias templates are required. For each type there is a basic alias template with a default ILM policy. These default templates apply to all indices which follow the indexing strategy and do not have a more specific dataset alias template.

The `metrics` and `logs` alias template contain all the basic fields from ECS.

Each type template contains an ILM policy. Modifying this default ILM policy will affect all data covered by the default templates.
Copy link
Contributor

@tsg tsg Jan 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, if you have an ILM policy for logs, and then another ILM policy for logs-*-mysql, will the latter overwrite the former? Is that achieved via template inheritance and order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are currently working on a new concept called "Alias Templates" (still under discussion). This will allow to compose different templates together and will not have the magical inheritance which caused issues in the past. If there is a template for logs-*-mysql with a different ILM policy then logs-*-* for all mysql indices, the mysql policy will be taken.


=== Defaults

If the Elastic Agent is used to ingest data and only the type is specified, `default` for the namespace is used and `generic` for the dataset.