Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dot_expand proceesor #20489

Closed
wants to merge 8 commits into from
Closed

Conversation

graphaelli
Copy link
Member

@graphaelli graphaelli commented Aug 6, 2020

What does this PR do?

Adds a dot_expand processor utilizing a new Expand method on MapStr - the reverse of Flatten.

Given an input log line event: {"foo.bar": 1}, an event with expanded fields will be published:

{
  "@timestamp": "2020-08-06T22:10:36.197Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "8.0.0"
  },
  "ecs": {
    "version": "1.5.0"
  },
  "host": {
    "os": {
      "name": "Mac OS X",
      "kernel": "19.4.0",
      "build": "19E287",
      "platform": "darwin",
      "version": "10.15.4",
      "family": "darwin"
    },
    ...
  },
  "foo": {
    "bar": 1
  }
}

Why is it important?

See #17021

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Performance of MapStr.Expand
  • Overlapping/duplicated key expectations
  • Additional edge cases

How to test this PR locally

./filebeat -e -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true}]' -E processors='[{dot_expand:{}}]' -d '*'

paste some json in and see it expanded in the output document.

Related issues

Closes #17021

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 6, 2020
@graphaelli graphaelli added the Team:Services (Deprecated) Label for the former Integrations-Services team label Aug 6, 2020
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 6, 2020
@elasticmachine
Copy link
Collaborator

elasticmachine commented Aug 7, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #20489 updated

  • Start Time: 2020-11-26T03:00:31.105+0000

  • Duration: 71 min 26 sec

Test stats 🧪

Test Results
Failed 0
Passed 16702
Skipped 1369
Total 18071

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 16702
Skipped 1369
Total 18071

}
}
return e, nil
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, more methods on common.MapStr. MapStr interface is somewhat inconsistent, but as the Expand interface looks to be consistent with 'Flatten', I think the interface is ok as is. btw. nice trick about using 'Put' :)

As you may have noticed during testing, this approach is non-deterministic. Whenever go creates a map, it generates a random 'seed' value to be used when hashing keys. Thusly keys are ordered randomly. This can actually lead to conflicts when expanding, e.g. if we have a.b=1 and a.b.c=2. Is a.b now a primitive or an object. In order to establish some determinism you need to create a sorted array of keys first (sorting based on length would be good enough).

Not sure where, but I think somewhere we solved the conflict by renaming a.b=1 to a.b.value=1

}
require.NoError(t, err)
assert.Equal(t, test.Expected, x)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expand and flatten should fulfill the properties flatten(expand(m)) == flatten(m), expand(flatten(m)) == expand(m). I wonder if this is always the case.

Copy link

@urso urso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the approach looks good. We're missing docs, though.

Very important is to have the 'Expand' be deterministic. If users report errors, we must be able to explain why things failed.

@graphaelli
Copy link
Member Author

Thanks very much for reviewing. Now that you've confirmed the approach is reasonable, we can update Expand to fix up https://github.com/elastic/beats/pull/20489/files#diff-99175365e34c2bfb5cfc3ed3d777f16bR875-R889, resolving the consistency concerns we share - I expect that can also address performance.

@felixbarny
Copy link
Member

Looks great! What's missing before it's ready for review?

@graphaelli
Copy link
Member Author

This needs docs, probably an experimental flag in case we want to change the default behavior or only apply this to specific fields or fail_on_error behavior, and probably at least an attempt at improving performance - I made no effort on that front.

@botelastic
Copy link

botelastic bot commented Oct 4, 2020

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@felixbarny
Copy link
Member

We might also consider an ES ingest node processor that can expand dots if users want to send the logs directly to Elasticsearch, or if they want to offload that potentially CPU expensive task from production servers.

This would help in cases reported in elastic/ecs-logging-java#94

Do we have a general policy or philosophy on processor parity for Beats and ingest node?

@botelastic
Copy link

botelastic bot commented Nov 19, 2020

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Nov 19, 2020
@botelastic botelastic bot removed the Stalled label Nov 20, 2020
@axw
Copy link
Member

axw commented Nov 22, 2020

and probably at least an attempt at improving performance - I made no effort on that front.

We could alternatively expand in place, reducing allocations. No need to sort keys to obtain a deterministic result; checking the values are both/neither maps means it's not sensitive to ordering.

EDIT: I realised some time later that this code isn't quite right, it doesn't properly handle the case where you have a map value; the m.Put call could replace intermediate maps. Needs adjusting to merge map values in.

func (m MapStr) ExpandInPlace() error {                                                                            
        for k, v := range m {                                                                                      
                if vm, ok := tryToMapStr(v); ok {                                                                  
                        if err := vm.ExpandInPlace(); err != nil {                                                 
                                return err                                                                         
                        }                                                                                          
                }                                                                                                  
                if dot := strings.IndexRune(k, '.'); dot >= 0 {                                                    
                        delete(m, k)                                                                               
                        old, err := m.Put(k, v)                                                                    
                        if err != nil {                                                                            
                                return err                                                                         
                        }
                        if old != nil {
                                _, oldIsMap := tryToMapStr(old)                                                            
                                _, newIsMap := tryToMapStr(v)                                                              
                                if oldIsMap != newIsMap {                                                                  
                                        return fmt.Errorf("incompatible types expanded for %s: old: %T new: %T", k, old, v)
                                }                                                                                          
                        }
                }                                                                                                  
        }                                                                                                          
        return nil                                                                                                 
}                                                                                                                  

axw added 3 commits November 26, 2020 09:10
I have intentionally left fail_on_error undocumented,
as its behaviour is not really living up to its name.
Either way the original fields are left intact; the
only difference is whether an error is logged.
@axw
Copy link
Member

axw commented Nov 26, 2020

I added some docs and tests to the existing implementation. After further testing, I didn't feel like an in-place expansion method was worth pursuing right now, with the additional complexity and only marginal improvement.

@axw axw marked this pull request as ready for review November 26, 2020 05:50
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@axw axw requested a review from urso November 26, 2020 05:50
}
return e, nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using the Expand() method to expand logging documents returning an error for the method would lead to loosing logs. Should we instead drop one of the fields causing the error (since the fields are ordered it would be deterministic) and maybe log a warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an error in the pipeline processing happens, I would expect that Filebeat would still index the event with the raw and invalid JSON as a string in the message field and with the error.message and error.type fields set.

See also this demo:

$ ./filebeat -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true, json.add_error_key:true}]'
foo
{"@timestamp":"2020-11-06T19:54:34.117Z","@metadata":{"beat":"filebeat","type":"_doc","version":"7.9.3"},"agent":{"ephemeral_id":"6bea7e71-64f3-4f05-bad9-561b33f93159","id":"d68943aa-0741-48d0-9c73-940d2cb14035","name":"ix.lan","type":"filebeat","version":"7.9.3","hostname":"ix.lan"},"message":"foo","log":{"offset":0,"file":{"path":""}},"error":{"message":"Error decoding JSON: invalid character 'o' in literal false (expecting 'a')","type":"json"},"input":{"type":"stdin"},"ecs":{"version":"1.5.0"},"host":{"name":"ix.lan"}}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simitt good point! I hadn't thought that through before: although libbeat will log a warning and continue indexing with the original document, Elasticsearch will internally expand the JSON object fields for field mapping, and this would lead to the same kind of conflict/error that we encounter in the processor, leading to the document not being indexed.

If we still drop conflicting fields in the error case, I wonder if it would be obvious enough from the document that there was an issue in processing.

@felixbarny At this point we've already decoded the JSON, and it's technically valid. I think it would be a bit surprising to then stringify the JSON and put it all back in the message field.

Unless... instead of adding a new processor, we add an option to the json input type instead? i.e. instead of

./filebeat -e -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true}]' -E processors='[{dot_expand:{}}]' -d '*'

we could have

./filebeat -e -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true, json.expand_fields:true}]' -d '*'

Then an error to expand fields would be be an error to decode, and the behaviour you describe would be natural.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having it directly in json.expand_fields would be nice. But it would then also have to be a property in the decode_json_fields processor.

@felixbarny
Copy link
Member

We could alternatively expand in place, reducing allocations.

I guess that would make it quite a bit more efficient in the case where there are no or only a few dotted fields. Is that right? That'd be nice to have, especially for loggers like ecs-dotnet as they only have log.level as a dotted field and all others are nested already.

@axw
Copy link
Member

axw commented Dec 1, 2020

I guess that would make it quite a bit more efficient in the case where there are no or only a few dotted fields. Is that right?

Correct.

That'd be nice to have, especially for loggers like ecs-dotnet as they only have log.level as a dotted field and all others are nested already.

Ah, that's news to me. I'm going to move this back into draft as I look into the alternative of adding an option to the json input/decode_json_fields processor. I'll reconsider making it in-place at the same time.

@graphaelli
Copy link
Member Author

superseded by #22849

@graphaelli graphaelli closed this Dec 14, 2020
@graphaelli graphaelli deleted the dot-expand branch December 14, 2020 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Services (Deprecated) Label for the former Integrations-Services team v7.11.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add de-dot processor that converts dotted field names to nested objects
6 participants