-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPM] Indexing strategy docs #55301
[EPM] Indexing strategy docs #55301
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -44,4 +44,39 @@ A user-specified string that will be used to part of the index name in Elasticse | |
|
||
==== Package | ||
|
||
A package contains all the assets for the Elastic Stack. A more detailed definition of a package can be found under https://github.com/elastic/package-registry . | ||
A package contains all the assets for the Elastic Stack. A more detailed definition of a package can be found under https://github.com/elastic/package-registry . | ||
|
||
|
||
== Indexing Strategy | ||
|
||
Ingest Management enforces an indexing strategy to allow the system to automically detect indices and run queries on it. In short the indexing strategy looks as following: | ||
|
||
``` | ||
{type}-{namespace}-{dataset} | ||
``` | ||
|
||
The `{type}` can be `logs` or `metrics`. The `{namespace}` is the part where the user can use free form. The only two requirement are that it has only characters allowed in an Elasticsearch index name and does NOT contain a `-`. The `dataset` is defined by the data that is indexed. The same requirements as for the namespace apply. It is expected that the fields for type, namespace and dataset are part of each event and are constant keywords. | ||
|
||
Note: More `{type}`s might be added in the future like `apm` and `endpoint`. | ||
|
||
This indexing strategy has a few advantages: | ||
|
||
* Each index contains only the fields which are relevant for the dataset. This leads to more dense indices and better field completion. | ||
* ILM policies can be applied per namespace per dataset. | ||
* Rollups can be specified per namespace per dataset. | ||
* Having the namespace user configurable makes setting security permissions possible. | ||
* Having a global metrics and logs template, allows to create new indices on demand which still follow the convention. This is common in the case of k8s as an example. | ||
* Constant keywords allow to narrow down the indices we need to access for querying very efficiently. This is especially relevant in environments which a large number of indices or with indices on slower nodes. | ||
|
||
|
||
=== Templates & ILM Policies | ||
|
||
To make the above strategy possible, alias templates are required. For each type there is a basic alias template with a default ILM policy. These default templates apply to all indices which follow the indexing strategy and do not have a more specific dataset alias template. | ||
|
||
The `metrics` and `logs` alias template contain all the basic fields from ECS. | ||
|
||
Each type template contains an ILM policy. Modifying this default ILM policy will affect all data covered by the default templates. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm curious, if you have an ILM policy for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are currently working on a new concept called "Alias Templates" (still under discussion). This will allow to compose different templates together and will not have the magical inheritance which caused issues in the past. If there is a template for |
||
|
||
=== Defaults | ||
|
||
If the Elastic Agent is used to ingest data and only the type is specified, `default` for the namespace is used and `generic` for the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how do we decide on when we create a new
{type}
and when to reuselogs
. There's a large number of events that feel like logs, but are not logs. Like alerts, network flows, process executions, packetbeat transactions, etc. ECS's event.kind currently lists 6 values. Maybe there's an opportunity to reuse some of that?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. So far the focus was much more on what "types" do we start with and with filebeat and metricbeat the two most obvious were logs and metrics. In general I would like to keep the types to a minimum as Elasticsearch will have to ship with a basic template for these types to make sure everything works out of the box.
The definition of the index should rely on ECS fields. It would have been nice to use
{event.type}-{event.namespace}-{event.dataset}
. Unfortunatelyevent.type
has already a different meaning. It seems whatevent.type
is should actually beevent.category.type
;-)For
event.kind
: I think it is close but not the same. The value that suprised me at first ispipeline_error
as having an error in a pipeline still means the event belongs to the same dataset. It seems events from 1 log file can have differentevent.kind
values.I think there is also an opportunity for some apps to create their own type like using
signal-*-*
for SIEM as an example. As far as I understand, this index would only be created when SIEM is used for the first time and could be set up by the SIEM app?Getting back to the initial question: How do we decide on the types? Lets turn around the question: What are the types you would need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: Thinking about which fields this could fit in:
stream.type
,event.namespace
,event.dataset
. The reason for thestream.*
is that it is what we currently plan to use as the config on the agent side.