add dot_expand proceesor #20489

graphaelli · 2020-08-06T22:26:39Z

What does this PR do?

Adds a dot_expand processor utilizing a new Expand method on MapStr - the reverse of Flatten.

Given an input log line event: {"foo.bar": 1}, an event with expanded fields will be published:

{
  "@timestamp": "2020-08-06T22:10:36.197Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "8.0.0"
  },
  "ecs": {
    "version": "1.5.0"
  },
  "host": {
    "os": {
      "name": "Mac OS X",
      "kernel": "19.4.0",
      "build": "19E287",
      "platform": "darwin",
      "version": "10.15.4",
      "family": "darwin"
    },
    ...
  },
  "foo": {
    "bar": 1
  }
}

Why is it important?

See #17021

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Performance of MapStr.Expand
Overlapping/duplicated key expectations
Additional edge cases

How to test this PR locally

./filebeat -e -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true}]' -E processors='[{dot_expand:{}}]' -d '*'

paste some json in and see it expanded in the output document.

Related issues

Closes #17021

elasticmachine · 2020-08-07T00:33:39Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #20489 updated
Start Time: 2020-11-26T03:00:31.105+0000
Duration: 71 min 26 sec

Test stats 🧪

Test	Results
Failed	0
Passed	16702
Skipped	1369
Total	18071

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	16702
Skipped	1369
Total	18071

urso · 2020-08-10T20:39:51Z

libbeat/common/mapstr.go

+		}
+	}
+	return e, nil
+}


Yay, more methods on common.MapStr. MapStr interface is somewhat inconsistent, but as the Expand interface looks to be consistent with 'Flatten', I think the interface is ok as is. btw. nice trick about using 'Put' :)

As you may have noticed during testing, this approach is non-deterministic. Whenever go creates a map, it generates a random 'seed' value to be used when hashing keys. Thusly keys are ordered randomly. This can actually lead to conflicts when expanding, e.g. if we have a.b=1 and a.b.c=2. Is a.b now a primitive or an object. In order to establish some determinism you need to create a sorted array of keys first (sorting based on length would be good enough).

Not sure where, but I think somewhere we solved the conflict by renaming a.b=1 to a.b.value=1

urso · 2020-08-10T20:43:38Z

libbeat/common/mapstr_test.go

+		}
+		require.NoError(t, err)
+		assert.Equal(t, test.Expected, x)
+	}


expand and flatten should fulfill the properties flatten(expand(m)) == flatten(m), expand(flatten(m)) == expand(m). I wonder if this is always the case.

urso

Overall the approach looks good. We're missing docs, though.

Very important is to have the 'Expand' be deterministic. If users report errors, we must be able to explain why things failed.

graphaelli · 2020-08-11T01:58:10Z

Thanks very much for reviewing. Now that you've confirmed the approach is reasonable, we can update Expand to fix up https://github.com/elastic/beats/pull/20489/files#diff-99175365e34c2bfb5cfc3ed3d777f16bR875-R889, resolving the consistency concerns we share - I expect that can also address performance.

and even less performant

felixbarny · 2020-09-04T09:50:18Z

Looks great! What's missing before it's ready for review?

graphaelli · 2020-09-04T11:56:34Z

This needs docs, probably an experimental flag in case we want to change the default behavior or only apply this to specific fields or fail_on_error behavior, and probably at least an attempt at improving performance - I made no effort on that front.

botelastic · 2020-10-04T12:36:09Z

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

felixbarny · 2020-10-20T08:42:32Z

We might also consider an ES ingest node processor that can expand dots if users want to send the logs directly to Elasticsearch, or if they want to offload that potentially CPU expensive task from production servers.

This would help in cases reported in elastic/ecs-logging-java#94

Do we have a general policy or philosophy on processor parity for Beats and ingest node?

botelastic · 2020-11-19T09:15:34Z

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

axw · 2020-11-22T03:48:17Z

and probably at least an attempt at improving performance - I made no effort on that front.

We could alternatively expand in place, reducing allocations. No need to sort keys to obtain a deterministic result; checking the values are both/neither maps means it's not sensitive to ordering.

EDIT: I realised some time later that this code isn't quite right, it doesn't properly handle the case where you have a map value; the m.Put call could replace intermediate maps. Needs adjusting to merge map values in.

func (m MapStr) ExpandInPlace() error {                                                                            
        for k, v := range m {                                                                                      
                if vm, ok := tryToMapStr(v); ok {                                                                  
                        if err := vm.ExpandInPlace(); err != nil {                                                 
                                return err                                                                         
                        }                                                                                          
                }                                                                                                  
                if dot := strings.IndexRune(k, '.'); dot >= 0 {                                                    
                        delete(m, k)                                                                               
                        old, err := m.Put(k, v)                                                                    
                        if err != nil {                                                                            
                                return err                                                                         
                        }
                        if old != nil {
                                _, oldIsMap := tryToMapStr(old)                                                            
                                _, newIsMap := tryToMapStr(v)                                                              
                                if oldIsMap != newIsMap {                                                                  
                                        return fmt.Errorf("incompatible types expanded for %s: old: %T new: %T", k, old, v)
                                }                                                                                          
                        }
                }                                                                                                  
        }                                                                                                          
        return nil                                                                                                 
}

I have intentionally left fail_on_error undocumented, as its behaviour is not really living up to its name. Either way the original fields are left intact; the only difference is whether an error is logged.

axw · 2020-11-26T05:50:46Z

I added some docs and tests to the existing implementation. After further testing, I didn't feel like an in-place expansion method was worth pursuing right now, with the additional complexity and only marginal improvement.

elasticmachine · 2020-11-26T05:50:53Z

Pinging @elastic/integrations-services (Team:Services)

simitt · 2020-11-26T12:24:05Z

libbeat/common/mapstr.go

+	}
+	return e, nil
+}
+


When using the Expand() method to expand logging documents returning an error for the method would lead to loosing logs. Should we instead drop one of the fields causing the error (since the fields are ordered it would be deterministic) and maybe log a warning?

If an error in the pipeline processing happens, I would expect that Filebeat would still index the event with the raw and invalid JSON as a string in the message field and with the error.message and error.type fields set.

See also this demo:

$ ./filebeat -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true, json.add_error_key:true}]' foo {"@timestamp":"2020-11-06T19:54:34.117Z","@metadata":{"beat":"filebeat","type":"_doc","version":"7.9.3"},"agent":{"ephemeral_id":"6bea7e71-64f3-4f05-bad9-561b33f93159","id":"d68943aa-0741-48d0-9c73-940d2cb14035","name":"ix.lan","type":"filebeat","version":"7.9.3","hostname":"ix.lan"},"message":"foo","log":{"offset":0,"file":{"path":""}},"error":{"message":"Error decoding JSON: invalid character 'o' in literal false (expecting 'a')","type":"json"},"input":{"type":"stdin"},"ecs":{"version":"1.5.0"},"host":{"name":"ix.lan"}}

@simitt good point! I hadn't thought that through before: although libbeat will log a warning and continue indexing with the original document, Elasticsearch will internally expand the JSON object fields for field mapping, and this would lead to the same kind of conflict/error that we encounter in the processor, leading to the document not being indexed.

If we still drop conflicting fields in the error case, I wonder if it would be obvious enough from the document that there was an issue in processing.

@felixbarny At this point we've already decoded the JSON, and it's technically valid. I think it would be a bit surprising to then stringify the JSON and put it all back in the message field.

Unless... instead of adding a new processor, we add an option to the json input type instead? i.e. instead of

./filebeat -e -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true}]' -E processors='[{dot_expand:{}}]' -d '*'

we could have

./filebeat -e -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true, json.expand_fields:true}]' -d '*'

Then an error to expand fields would be be an error to decode, and the behaviour you describe would be natural.

Having it directly in json.expand_fields would be nice. But it would then also have to be a property in the decode_json_fields processor.

felixbarny · 2020-11-30T07:20:02Z

We could alternatively expand in place, reducing allocations.

I guess that would make it quite a bit more efficient in the case where there are no or only a few dotted fields. Is that right? That'd be nice to have, especially for loggers like ecs-dotnet as they only have log.level as a dotted field and all others are nested already.

axw · 2020-12-01T06:10:40Z

I guess that would make it quite a bit more efficient in the case where there are no or only a few dotted fields. Is that right?

Correct.

That'd be nice to have, especially for loggers like ecs-dotnet as they only have log.level as a dotted field and all others are nested already.

Ah, that's news to me. I'm going to move this back into draft as I look into the alternative of adding an option to the json input/decode_json_fields processor. I'll reconsider making it in-place at the same time.

graphaelli · 2020-12-14T14:16:09Z

superseded by #22849

graphaelli added 2 commits August 6, 2020 18:23

add MapStr.Expand()

18ea655

add dot_expand proceesor

bc47323

graphaelli added enhancement v8.0.0 v7.10.0 labels Aug 6, 2020

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 6, 2020

graphaelli added the Team:Services (Deprecated) Label for the former Integrations-Services team label Aug 6, 2020

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 6, 2020

urso reviewed Aug 10, 2020

View reviewed changes

graphaelli added 3 commits August 12, 2020 17:29

typo

eefd17b

remove redundant test

8d14524

make Expand deterministic

1b6bbcb

and even less performant

botelastic bot added the Stalled label Oct 4, 2020

felixbarny mentioned this pull request Oct 19, 2020

[Meta] General Availability elastic/ecs-logging#27

Closed

botelastic bot removed the Stalled label Oct 20, 2020

felixbarny mentioned this pull request Oct 20, 2020

Nested JSON - De_dot plugin equivalent in log4j elastic/ecs-logging-java#94

Closed

botelastic bot added the Stalled label Nov 19, 2020

felixbarny added v7.11.0 and removed v7.10.0 labels Nov 20, 2020

botelastic bot removed the Stalled label Nov 20, 2020

axw added 3 commits November 26, 2020 09:10

Merge branch 'master' into dot-expand

95a5e60

Add docs, tests for processor

be36b74

I have intentionally left fail_on_error undocumented, as its behaviour is not really living up to its name. Either way the original fields are left intact; the only difference is whether an error is logged.

docs: add dot_expand processor to processor list

0cb06de

axw marked this pull request as ready for review November 26, 2020 05:50

axw requested a review from urso November 26, 2020 05:50

simitt reviewed Nov 26, 2020

View reviewed changes

axw marked this pull request as draft December 1, 2020 06:10

axw mentioned this pull request Dec 2, 2020

Add 'expand_keys' option to JSON input/processor #22849

Merged

6 tasks

urso mentioned this pull request Dec 14, 2020

Cherry-pick #22849 to 7.x: Add 'expand_keys' option to JSON input/processor #23104

Merged

6 tasks

graphaelli closed this Dec 14, 2020

graphaelli deleted the dot-expand branch December 14, 2020 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add dot_expand proceesor #20489

add dot_expand proceesor #20489

graphaelli commented Aug 6, 2020 •

edited

Loading

elasticmachine commented Aug 7, 2020 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Test stats 🧪

urso Aug 10, 2020

urso Aug 10, 2020

urso left a comment

graphaelli commented Aug 11, 2020

felixbarny commented Sep 4, 2020

graphaelli commented Sep 4, 2020

botelastic bot commented Oct 4, 2020

felixbarny commented Oct 20, 2020

botelastic bot commented Nov 19, 2020

axw commented Nov 22, 2020 •

edited

Loading

axw commented Nov 26, 2020

elasticmachine commented Nov 26, 2020

simitt Nov 26, 2020

felixbarny Nov 26, 2020

axw Nov 29, 2020

felixbarny Nov 30, 2020

felixbarny commented Nov 30, 2020

axw commented Dec 1, 2020

graphaelli commented Dec 14, 2020

add dot_expand proceesor #20489

add dot_expand proceesor #20489

Conversation

graphaelli commented Aug 6, 2020 • edited Loading

What does this PR do?

Why is it important?

Checklist

Author's Checklist

How to test this PR locally

Related issues

elasticmachine commented Aug 7, 2020 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

Test stats 🧪

urso Aug 10, 2020

Choose a reason for hiding this comment

urso Aug 10, 2020

Choose a reason for hiding this comment

urso left a comment

Choose a reason for hiding this comment

graphaelli commented Aug 11, 2020

felixbarny commented Sep 4, 2020

graphaelli commented Sep 4, 2020

botelastic bot commented Oct 4, 2020

felixbarny commented Oct 20, 2020

botelastic bot commented Nov 19, 2020

axw commented Nov 22, 2020 • edited Loading

axw commented Nov 26, 2020

elasticmachine commented Nov 26, 2020

simitt Nov 26, 2020

Choose a reason for hiding this comment

felixbarny Nov 26, 2020

Choose a reason for hiding this comment

axw Nov 29, 2020

Choose a reason for hiding this comment

felixbarny Nov 30, 2020

Choose a reason for hiding this comment

felixbarny commented Nov 30, 2020

axw commented Dec 1, 2020

graphaelli commented Dec 14, 2020

graphaelli commented Aug 6, 2020 •

edited

Loading

elasticmachine commented Aug 7, 2020 •

edited by jenkins-beats-ci bot

Loading

axw commented Nov 22, 2020 •

edited

Loading