Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Wildcard - stage 2 proposal #970

Merged
merged 30 commits into from
Oct 2, 2020
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
fb0c6a2
stage two updates
ebeahan Sep 17, 2020
da4aa73
adding wildcard schema files
ebeahan Sep 17, 2020
00eae5b
add link for stage 2 PR
ebeahan Sep 17, 2020
e16ed89
fix links to schema files
ebeahan Sep 17, 2020
da71db0
rephrasing
ebeahan Sep 17, 2020
e5ea69f
refactor table for better readability
ebeahan Sep 30, 2020
e004af7
Adjust index globs in query examples
Oct 1, 2020
ab338c8
Migrate same fields for client/server as for source/destination
Oct 1, 2020
fcbdb87
Don't migrate agent.name but migrate agent.build.original
Oct 1, 2020
6ac5ecd
migrate error.type
Oct 1, 2020
5e5f443
Migrate event.original
Oct 1, 2020
145721b
Boldly migrate geo.name. I'm sure some folk capture semi structured c…
Oct 1, 2020
4799175
Migrate only host.hostname in the host field set
Oct 1, 2020
f51ecf8
Don't migrate user.domain
Oct 1, 2020
48e656c
Migrate log.logger and log.file.path
Oct 1, 2020
c231e25
Migrate organization.name and its cousin that lives in as.*
Oct 1, 2020
33b4caa
Migrate the certificate issuer and subject fields
Oct 1, 2020
b984f93
Migrate registry.data.strings
Oct 1, 2020
05cad53
Migrate dns.question.name and dns.answers.data
Oct 1, 2020
721d43e
Migrate url.path
Oct 1, 2020
43f6c76
Adjust the table accordingly
Oct 1, 2020
e94e8a5
Add a section about migrating text fields to wildcard
Oct 1, 2020
df4f974
fix typo
ebeahan Oct 2, 2020
7a4a3e9
add pe.original_file_name
ebeahan Oct 2, 2020
560a080
adding request.referrer
ebeahan Oct 2, 2020
2ccc1bd
final to file
ebeahan Oct 2, 2020
34694ac
continue not indexing event.original
ebeahan Oct 2, 2020
688efd9
Update rfcs/text/0001/pe.yml
ebeahan Oct 2, 2020
497fb50
typo
ebeahan Oct 2, 2020
cae4d8b
setting advancement date
ebeahan Oct 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
212 changes: 192 additions & 20 deletions rfcs/text/0001-wildcard-data-type.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# 0001: Wildcard Field Adoption into ECS
<!--^ The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC, taking care not to conflict with other RFCs.-->

- Stage: **1 (proposal)** <!-- Update to reflect target stage -->
- Stage: **2 (draft)** <!-- Update to reflect target stage -->
- Date: **TBD** <!-- Update to reflect date of most recent stage advancement -->

Wildcard is a data type for Elasticsearch string fields being introduced in Elasticsearch 7.9. Wildcard optimizes performance for queries using wildcards (`*`) and regex, allowing users to perform `grep`-like searches without the limitations of the existing
Expand All @@ -10,28 +10,46 @@ text[0] and keyword[1] types.
## Fields

<!--
Stage: 1: Describe at a high level how this change affects fields. Which fieldsets will be impacted? How many fields overall? Are we primarily adding fields, removing fields, or changing existing fields? The goal here is to understand the fundamental technical implications and likely extent of these changes. ~2-5 sentences.
Stage 2: Include new or updated yml field definitions for all of the essential fields in this draft. While not exhaustive, the fields documented here should be comprehensive enough to deeply evaluate the technical considerations of this change. The goal here is to validate the technical details for all essential fields and to provide a basis for adding experimental field definitions to the schema. Use GitHub code blocks with yml syntax formatting.
-->

For a field to use wildcard, it will require changing the the field's defined schema `type` from `keyword` to `wildcard`. The following fieldsets are expected to adopt `wildcard` in at least one of their fields:

* `agent.*`
* `destination.*`
* `error.*`
* `file.*`
* `host.*`
* `http.*`
* `os.*`
* `process.*`
* `registry.*`
* `source.*`
* `url.*`
* `user.*`
* `user_agent.*`
### Identified Wildcard Fields

For a field to use wildcard, it will require changing the the field's defined schema `type` from `keyword` to `wildcard`. The following fields are candidates for `wildcard`:

| Field Set | Field(s) |
| --------- | -------- |
| [`agent`](0001/agent.yml) | `agent.build.original` |
| [`as`](0001/as.yml) | `as.organization.name` |
| [`client`](0001/client.yml) | `client.domain`<br> `client.registered_domain` |
| [`destination`](0001/destination.yml) | `destination.domain`<br> `destination.registered_domain` |
| [`dns`](0001/dns.yml) | `dns.question.name`<br> `dns.answers.data` |
| [`error`](0001/error.yml) | `error.stack_trace`<br> `error.type` |
| [`event`](0001/event.yml) | `event.original` |
| [`file`](0001/file.yml) | `file.directory`<br> `file.path`<br> `file.target_path` |
| [`geo`](0001/geo.yml) | `geo.name` |
| [`host`](0001/host.yml) | `host.hostname`<br> |
| [`http`](0001/http.yml) | `http.request.referrer`<br> `http.request.body.content`<br> `http.response.body.content` |
| [`log`](0001/log.yml) | `log.file.path`<br> `log.logger` |
| [`os`](0001/os.yml) | `os.name`<br> `os.full` |
| [`pe`](0001/pe.yml) | `pe.original_file_name` |
| [`process`](0001/process.yml) | `process.command_line`<br> `process.executable`<br> `process.name`<br> `process.title`<br> `process.working_directory`<br> |
| [`registry`](0001/registry.yml) | `registry.key`<br> `registry.path`<br> `registry.data.strings` |
| [`server`](0001/server.yml) | `server.domain`<br> `server.registered_domain` |
| [`source`](0001/source.yml) | `source.domain`<br> `source.registered_domain` |
| [`tls`](0001/tls.yml) | `tls.client.issuer`<br> `tls.client.subject`<br> `tls.server.issuer`<br> `tls.server.subject` |
| [`url`](0001/url.yml) | `url.full`<br> `url.original`<br> `url.path`<br> `url.domain`<br> `url.registered_domain` |
| [`user`](0001/user.yml) | `user.name`<br> `user.full_name`<br> `user.email`<br> `user.domain` |
| [`user_agent`](0001/user_agent.yml) | `user_agent.original` |
| [`x509`](0001/x509.yml) | `x509.issuer.distinguished_name`<br> `x509.subject.distinguished_name` |

The full set of schema files which will be transitioning to `wildcard` are located in directory [rfcs/text/0001/](0001/).

### Example definition

Here's an example of applying this change to the `process.command_line` field:

**Definition as of ECS 1.5.0**
**Definition as of ECS 1.6.0**

Schema definition:

Expand Down Expand Up @@ -232,9 +250,11 @@ Additional cases for wildcard searching against command line executions:
## Source data

<!--
Stage 1: Provide a high-level description of example sources of data. This does not yet need to be a concrete example of a source document, but instead can simply describe a potential source (e.g. nginx access log). This will ultimately be fleshed out to include literal source examples in a future stage. The goal here is to identify practical sources for these fields in the real world. ~1-3 sentences or unordered list.
Stage 2: Included a real world example source document. Ideally this example comes from the source(s) identified in stage 1. If not, it should replace them. The goal here is to validate the utility of these field changes in the context of a real world example. Format with the source name as a ### header and the example document in a GitHub code block with json formatting.
-->

### Categories

* Windows events
* Sysmon events
* Powershell events
Expand All @@ -244,6 +264,138 @@ Stage 1: Provide a high-level description of example sources of data. This does
* Endpoint agents
* Application stack traces

### Real world examples

Each example in this section contains a partial index mapping, a partial event, and one wildcard search query. Each query example uses a leading wildcard on expected high-cardinality fields where `wildcard` is performs far better than `keyword`.

**Windows registry event from sysmon:**

```
### Mapping (partial)
...
"registry" : {
"properties" : {
"key" : {
"type" : "wildcard"
}
}
}
...

### Event (partial)
...
"registry": {
"path": "HKU\\S-1-5-21-1957236100-58272097-297103362-500\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\Advanced\\HideFileExt",
"hive": "HKU",
"key": "S-1-5-21-1957236100-58272097-297103362-500\\Software\\Microsoft\\Windows\\CurrentVersion\\Explorer\\Advanced\\HideFileExt",
"value": "HideFileExt",
"data": {
"strings": [
"1"
],
"type": "SZ_DWORD"
}
...

### Query

GET winlogbeat-*/_search
{
"query": {
"wildcard": {
"registry.key": {
"value": "*CurrentVersion*"
}
}
}
}

```

**Windows Powershell logging event:**

```
### Mapping (partial)
...
"process" : {
"properties" : {
"command_line" : {
"type" : "wildcard",
"fields" : {
"text" : {
"type" : "text",
"norms" : false
}
}
}
}
}
...

### Event (partial)

"process": {
"pid": 3540,
...
"command_line": "C:\\Windows\\System32\\svchost.exe -k netsvcs -p -s NetSetupSvc"
}

### Query

GET winlogbeat-*/_search
{
"_source": false,
"query": {
"wildcard": {
"process.command_line": {
"value": "*-k netsvcs -p*"
}
}
}
}
```

**Wildcard query against original URL from a squid web proxy event:**

```
### Mapping (partial)

...
"url" : {
"original" : {
"type" : "wildcard",
"fields" : {
"text" : {
"type" : "text",
"norms" : false
}
}
}
...

### Event (partial)

...
"url": {
"original": "http://example.com/cart.do?action=view&itemId=HolyGouda",
"domain": "example.com"
}
...

### Query

GET filebeat-*/_search
{
"_source": false,
"query": {
"wildcard": {
"url.original": {
"value": "*action=view*Gouda"
}
}
}
}
```

## Scope of impact

Expand All @@ -270,7 +422,7 @@ ECS is and will remain an open source licensed project. However, there will be f
## Concerns

<!--
Stage 1: Identify potential concerns, implementation challenges, or complexity. Spend some time on this. Play devil's advocate. Try to identify the sort of non-obvious challenges that tend to surface later. The goal here is to surface risks early, allow everyone the time to work through them, and ultimately document resolution for posterity's sake.
Stage 2: Document new concerns or resolutions to previously listed concerns. It's not critical that all concerns have resolutions at this point, but it would be helpful if resolutions were taking shape for the most significant concerns.
-->

### Wildcard and case-insensitivity
Expand All @@ -287,6 +439,8 @@ Performance and storage characteristics between wildcard and keyword will be dif

ECS applies the `ignore_above` setting to keyword fields to prevent strings longer than 1024 characters from being indexed or stored. While `ignore_above` can be raised, Lucene implements a term byte-length limit of 32766 which cannot be adjusted. Wildcard supports an unlimited max character size for a field value. The `wildcard` field type will still have the `ignore_above` option available, and a reasonable limit may be need applied to mitigate unexpected side-effects.

For the initial adoption into ECS, `wildcard` fields will not have an `ignore_above` option defined.

### Licensing

Until now ECS has relied only on OSS licensed features, but ECS will also support Elastic licensed features. The ECS project will remain OSS licensed with the schema implementing Elastic licensed features as part of the specification. When ECS adopts a feature available only under a license, it will be noted in the documentation. ECS plans to provide tooling options which continue to support OSS consumers of ECS and the Elastic Stack.
Expand All @@ -295,6 +449,23 @@ Until now ECS has relied only on OSS licensed features, but ECS will also suppor

A data shipper which uses the `wildcard` field type may need to verify that the configured output Elasticsearch destination can support it (>= 7.9.0). For example, if a future version of Beats adopts `wildcard` in index mappings, Beats would may need to gracefully handle a scenario where the targeted Elasticsearch instance doesn't support the data type.

### Text fields migrating to wildcard

ECS currently has two `text` fields that would likely benefit from migrating to `wildcard`.
Doing so on the canonical field (as opposed to adding a multi-field) would be a breaking change.
However adding a `.wildcard` multi-field may cause confusion, as they would be the only
places where `wildcard` appears as a multi-field.

The fields are:

- `message`
- `error.message`

Paradoxically, in some cases they also benefit from the `text` data type.
A prime example is Windows Event Logs' main messages, which is stored in the `message` field.

The situation is captured here for addressing at a later stage.

## People

The following are the people that consulted on the contents of this RFC.
Expand Down Expand Up @@ -326,3 +497,4 @@ The following are the people that consulted on the contents of this RFC.

* Stage 0: https://github.com/elastic/ecs/pull/890
* Stage 1: https://github.com/elastic/ecs/pull/904
* Stage 2: https://github.com/elastic/ecs/pull/970
5 changes: 5 additions & 0 deletions rfcs/text/0001/agent.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
- name: agent
fields:
- name: build.original
type: wildcard
5 changes: 5 additions & 0 deletions rfcs/text/0001/as.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
- name: as
fields:
- name: organization.name
type: wildcard
7 changes: 7 additions & 0 deletions rfcs/text/0001/client.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
- name: client
fields:
- name: domain
type: wildcard
- name: registered_domain
type: wildcard
7 changes: 7 additions & 0 deletions rfcs/text/0001/destination.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
- name: destination
fields:
- name: domain
type: wildcard
- name: registered_domain
type: wildcard
7 changes: 7 additions & 0 deletions rfcs/text/0001/dns.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
- name: dns
fields:
- name: question.name
type: wildcard
- name: answers.data
type: wildcard
9 changes: 9 additions & 0 deletions rfcs/text/0001/error.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
- name: error
fields:
- name: stack_trace
index: true
type: wildcard

- name: type
type: wildcard
5 changes: 5 additions & 0 deletions rfcs/text/0001/event.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
- name: event
fields:
- name: original
type: wildcard
9 changes: 9 additions & 0 deletions rfcs/text/0001/file.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
- name: file
fields:
- name: directory
type: wildcard
- name: path
type: wildcard
- name: target_path
type: wildcard
5 changes: 5 additions & 0 deletions rfcs/text/0001/geo.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
- name: geo
fields:
- name: name
type: wildcard
4 changes: 4 additions & 0 deletions rfcs/text/0001/host.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
- name: host
fields:
- name: hostname
type: wildcard
9 changes: 9 additions & 0 deletions rfcs/text/0001/http.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
- name: http
fields:
- name: request.body.content
type: wildcard
- name: request.referrer
type: wildcard
- name: response.body.content
type: wildcard
7 changes: 7 additions & 0 deletions rfcs/text/0001/log.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
- name: log
fields:
- name: file.path
type: wildcard
- name: logger
type: wildcard
5 changes: 5 additions & 0 deletions rfcs/text/0001/organization.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
- name: organization
fields:
- name: name
type: wildcard
7 changes: 7 additions & 0 deletions rfcs/text/0001/os.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
- name: os
fields:
- name: name
type: wildcard
- name: full
type: wildcard
5 changes: 5 additions & 0 deletions rfcs/text/0001/pe.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
- name: process
ebeahan marked this conversation as resolved.
Show resolved Hide resolved
fields:
- name: original_file_name
type: wildcard
13 changes: 13 additions & 0 deletions rfcs/text/0001/process.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
- name: process
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these apply to process.parent asa well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that process.parent is managed by the field reuse mechanism, this will indeed apply to it :-)

fields:
- name: command_line
type: wildcard
- name: executable
type: wildcard
- name: name
type: wildcard
- name: title
type: wildcard
- name: working_directory
type: wildcard
Loading