Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the dissect library #32297

Merged
merged 12 commits into from
Aug 15, 2018
Merged

Conversation

jakelandis
Copy link
Contributor

The dissect library will be used for the ingest node as an alternative
to Grok to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the string.
Note - Regular expressions are used during construction of the
objects, but not in the hot path.

A dissect pattern takes the form of: '%{a} %{b},%{c}' which is
composed of 3 keys (a,b,c) and two delimiters (space and comma).
This dissect pattern will match a string of the form: 'foo bar,baz'
and will result a key/value pairing of 'a=foo, b=bar, and c=baz'.
See the comments in DissectParser for a full explanation.

This commit does not include the ingest node processor that will consume
it. However, the consumption should be a trivial mapping between the
key/value pairing returned by the parser and the key/value pairing
needed for the IngestDocument.


Note to reviewer(s) -

  • The the ingest node processor that consumes this will be submitted soon.
  • I have not performed performance testing/optimization on this yet, however, I am pretty confident that as-is it will be faster then Grok, but will get some numbers to back that claim.
  • Functionality is based on the Logstash dissect plugin, however (mostly) due to tight coupling with Logstash specific code this implementation does not share any code with Logstash.
  • The Logstash unit tests have been manually ported here as 'testLogstashSpecs'. The tests most generally passed without any modification, however there are some edge case differences between the implementations that may require further discussion. The test has comments to the differences.

cc: @guyboertje @ph

The dissect library will be used for the ingest node as an alternative
to Grok to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the string.
Note - Regular expressions are used during construction of the
objects, but not in the hot path.

A dissect pattern takes the form of: '%{a} %{b},%{c}' which is
composed of 3 keys (a,b,c) and two delimiters (space and comma).
This dissect pattern will match a string of the form: 'foo bar,baz'
and will result a key/value pairing of 'a=foo, b=bar, and c=baz'.
See the comments in DissectParser for a full explanation.

This commit does not include the ingest node processor that will consume
it. However, the consumption should be a trivial mapping between the
key/value pairing returned by the parser and the key/value pairing
needed for the IngestDocument.
@jakelandis jakelandis added >enhancement :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v7.0.0 v6.5.0 labels Jul 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jakelandis
Copy link
Contributor Author

Changing this PR to a WIP. Running some micro benchmarks shows some potential performance issue in the code, in particular with the match validation (intersection of two lists) and post processing (grouping and sorting).

@jakelandis jakelandis changed the title Introduce the dissect library WIP: Introduce the dissect library Jul 24, 2018
@jakelandis
Copy link
Contributor Author

I have addressed the performance bottleneck and updated this PR with the 7.0 Dissect behavior (Beats and Logstash will make corresponding changes to support the same specification for 7.0). I will need to create a small fork of this for the 6.x version to better match Beats and Logstash 6.x behavior.

I ran some micro benchmarks comparing this implementation vs. (ingest node) grok. For parsing a basic syslog and apache log line, this is ~ 3x-5x faster. The score is number of times parsed per second (higher is better). Full details here: https://gist.github.com/jakelandis/663306337e370df9c8c0cc6d5a1f9104

# Run complete. Total time: 00:22:02

Benchmark                           Mode  Cnt       Score      Error  Units
DissectBenchmark.dissectApacheLog  thrpt   10  423663.010 ± 2727.341  ops/s
DissectBenchmark.dissectSysLog     thrpt   10  862173.761 ± 8636.887  ops/s
DissectBenchmark.grokApacheLog     thrpt   10   91178.469 ± 2172.043  ops/s
DissectBenchmark.grokSysLog        thrpt   10  300408.741 ± 3586.303  ops/s

I am not sure if this a meaningful difference in the context of the ingest node processing, and will run similar macro benchmarks against ingest node pipelines (on the Dissect Processor PR).

@jakelandis jakelandis changed the title WIP: Introduce the dissect library Introduce the dissect library Jul 26, 2018
@jakelandis
Copy link
Contributor Author

Removed the WIP. This is ready for review.

@guyboertje - can you please validate the main logic (there is a test that mirrors logstash specs)
@rjernst - could you also review

@jakelandis
Copy link
Contributor Author

jakelandis commented Jul 31, 2018

I removed the 6.5.0 label from this PR, since the 6.5.0 will require a fork of this PR that implements slightly different rules. After speaking with @ph (beats) and @guyboertje (logstash) the 7.x goal is for stricter compliance to a newly formalized dissect specification. The 6.x goal is for approximate like for like behavior between the 3 products.

This means that the 7.x will be a breaking change to the 6.x, hence the breaking tag.

Specifically, the breaking change will be:

  • disambiguate the meaning of ? - 6.x it can be either a named skip key or the key of the reference (indirect) key/value pair (? paired with an &)
    • ? has been replaced by * for key/value reference pairing (formerly ? and & pairs, now * and & pairs).
    • key/value reference pairings * and & (formerly ? and &) are required to come in pairs.

EDIT: After further discussion, this will NOT be a breaking change and will be backported as-is to 6.x. IGNORE THE ABOVE COMMENT.

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok from the build side. One general comment is I see a lot of package private methods. If they need to be package private for tests that is fine, but please note this with a comment like // pkg private for tests or something like that. Otherwise, please use the least visibility possible.

@jakelandis
Copy link
Contributor Author

@rjernst - thanks , will add package private comments (or reduce scope if possible)

@guyboertje - could you please review ?

@jakelandis jakelandis requested a review from guyboertje August 8, 2018 13:17
Copy link

@guyboertje guyboertje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jakelandis
Copy link
Contributor Author

@guyboertje @rjernst thanks!

reduced scope where possible in 9e43ba8

will merge when CI is green.

@jakelandis
Copy link
Contributor Author

Jenkins test this please

@jakelandis jakelandis merged commit be62092 into elastic:master Aug 15, 2018
jasontedor added a commit that referenced this pull request Aug 15, 2018
* elastic/master:
  Revert "cluster formation DSL - Gradle integration -  part 2 (#32028)" (#32876)
  cluster formation DSL - Gradle integration -  part 2 (#32028)
  Introduce global checkpoint listeners (#32696)
  Move connection profile into connection manager (#32858)
  [ML] Temporarily disabling rolling-upgrade tests
  Use generic AcknowledgedResponse instead of extended classes (#32859)
  [ML] Removing old per-partition normalization code (#32816)
  Use JDK 10 for 6.4 BWC builds (#32866)
  Removed flaky test. Looks like randomisation makes these assertions unreliable.
  [test] mute IndexShardTests.testDocStats
  Introduce the dissect library (#32297)
  Security: remove password hash bootstrap check (#32440)
  Move validation to server for put user requests (#32471)
  [ML] Add high level REST client docs for ML put job endpoint (#32843)
  Test: Fix forbidden uses in test framework (#32824)
  Painless: Change fqn_only to no_import (#32817)
  [test] mute testSearchWithSignificantTermsAgg
  Watcher: Remove unused hipchat render method (#32211)
  Watcher: Remove extraneous auth classes (#32300)
  Watcher: migrate PagerDuty v1 events API to v2 API (#32285)
@jakelandis
Copy link
Contributor Author

Removing the breaking label ... after further discussion it preferred to have a minor difference between ingest/Logstash/Beats in the 6.x instead of a breaking change in ES.

I will backport this PR as-is to 6.x (ignore any comments above about breaking changes).

jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Aug 28, 2018
The dissect library will be used for the ingest node as an alternative
to Grok to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the string.
Note - Regular expressions are used during construction of the
objects, but not in the hot path.

A dissect pattern takes the form of: '%{a} %{b},%{c}' which is
composed of 3 keys (a,b,c) and two delimiters (space and comma).
This dissect pattern will match a string of the form: 'foo bar,baz'
and will result a key/value pairing of 'a=foo, b=bar, and c=baz'.
See the comments in DissectParser for a full explanation.

This commit does not include the ingest node processor that will consume
it. However, the consumption should be a trivial mapping between the
key/value pairing returned by the parser and the key/value pairing
needed for the IngestDocument.
jakelandis added a commit to jakelandis/elasticsearch that referenced this pull request Sep 5, 2018
The dissect library will be used for the ingest node as an alternative
to Grok to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the string.
Note - Regular expressions are used during construction of the
objects, but not in the hot path.

A dissect pattern takes the form of: '%{a} %{b},%{c}' which is
composed of 3 keys (a,b,c) and two delimiters (space and comma).
This dissect pattern will match a string of the form: 'foo bar,baz'
and will result a key/value pairing of 'a=foo, b=bar, and c=baz'.
See the comments in DissectParser for a full explanation.

This commit does not include the ingest node processor that will consume
it. However, the consumption should be a trivial mapping between the
key/value pairing returned by the parser and the key/value pairing
needed for the IngestDocument.
jakelandis added a commit that referenced this pull request Sep 5, 2018
* Introduce the dissect library (#32297)

The dissect library will be used for the ingest node as an alternative
to Grok to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the string.
Note - Regular expressions are used during construction of the
objects, but not in the hot path.

A dissect pattern takes the form of: '%{a} %{b},%{c}' which is
composed of 3 keys (a,b,c) and two delimiters (space and comma).
This dissect pattern will match a string of the form: 'foo bar,baz'
and will result a key/value pairing of 'a=foo, b=bar, and c=baz'.
See the comments in DissectParser for a full explanation.

This commit does not include the ingest node processor that will consume
it. However, the consumption should be a trivial mapping between the
key/value pairing returned by the parser and the key/value pairing
needed for the IngestDocument.

* ingest: Introduce the dissect processor (#32884)

The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.

Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.

* ingest: minor - update test to include dissect (#33211)

This change also includes placing the bytes processor in the correct
order (helps to avoid merge conflict when back patching processors)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants