-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add dot_expand proceesor #20489
add dot_expand proceesor #20489
Conversation
} | ||
} | ||
return e, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, more methods on common.MapStr. MapStr interface is somewhat inconsistent, but as the Expand interface looks to be consistent with 'Flatten', I think the interface is ok as is. btw. nice trick about using 'Put' :)
As you may have noticed during testing, this approach is non-deterministic. Whenever go creates a map, it generates a random 'seed' value to be used when hashing keys. Thusly keys are ordered randomly. This can actually lead to conflicts when expanding, e.g. if we have a.b=1
and a.b.c=2
. Is a.b
now a primitive or an object. In order to establish some determinism you need to create a sorted array of keys first (sorting based on length would be good enough).
Not sure where, but I think somewhere we solved the conflict by renaming a.b=1
to a.b.value=1
} | ||
require.NoError(t, err) | ||
assert.Equal(t, test.Expected, x) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expand and flatten should fulfill the properties flatten(expand(m)) == flatten(m)
, expand(flatten(m)) == expand(m)
. I wonder if this is always the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the approach looks good. We're missing docs, though.
Very important is to have the 'Expand' be deterministic. If users report errors, we must be able to explain why things failed.
Thanks very much for reviewing. Now that you've confirmed the approach is reasonable, we can update |
and even less performant
Looks great! What's missing before it's ready for review? |
This needs docs, probably an experimental flag in case we want to change the default behavior or only apply this to specific fields or fail_on_error behavior, and probably at least an attempt at improving performance - I made no effort on that front. |
Hi! We're labeling this issue as |
We might also consider an ES ingest node processor that can expand dots if users want to send the logs directly to Elasticsearch, or if they want to offload that potentially CPU expensive task from production servers. This would help in cases reported in elastic/ecs-logging-java#94 Do we have a general policy or philosophy on processor parity for Beats and ingest node? |
Hi! We're labeling this issue as |
We could alternatively expand in place, reducing allocations. No need to sort keys to obtain a deterministic result; checking the values are both/neither maps means it's not sensitive to ordering. EDIT: I realised some time later that this code isn't quite right, it doesn't properly handle the case where you have a map value; the func (m MapStr) ExpandInPlace() error {
for k, v := range m {
if vm, ok := tryToMapStr(v); ok {
if err := vm.ExpandInPlace(); err != nil {
return err
}
}
if dot := strings.IndexRune(k, '.'); dot >= 0 {
delete(m, k)
old, err := m.Put(k, v)
if err != nil {
return err
}
if old != nil {
_, oldIsMap := tryToMapStr(old)
_, newIsMap := tryToMapStr(v)
if oldIsMap != newIsMap {
return fmt.Errorf("incompatible types expanded for %s: old: %T new: %T", k, old, v)
}
}
}
}
return nil
} |
I have intentionally left fail_on_error undocumented, as its behaviour is not really living up to its name. Either way the original fields are left intact; the only difference is whether an error is logged.
I added some docs and tests to the existing implementation. After further testing, I didn't feel like an in-place expansion method was worth pursuing right now, with the additional complexity and only marginal improvement. |
Pinging @elastic/integrations-services (Team:Services) |
} | ||
return e, nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using the Expand()
method to expand logging documents returning an error for the method would lead to loosing logs. Should we instead drop one of the fields causing the error (since the fields are ordered it would be deterministic) and maybe log a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an error in the pipeline processing happens, I would expect that Filebeat would still index the event with the raw and invalid JSON as a string in the message
field and with the error.message
and error.type
fields set.
See also this demo:
$ ./filebeat -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true, json.add_error_key:true}]'
foo
{"@timestamp":"2020-11-06T19:54:34.117Z","@metadata":{"beat":"filebeat","type":"_doc","version":"7.9.3"},"agent":{"ephemeral_id":"6bea7e71-64f3-4f05-bad9-561b33f93159","id":"d68943aa-0741-48d0-9c73-940d2cb14035","name":"ix.lan","type":"filebeat","version":"7.9.3","hostname":"ix.lan"},"message":"foo","log":{"offset":0,"file":{"path":""}},"error":{"message":"Error decoding JSON: invalid character 'o' in literal false (expecting 'a')","type":"json"},"input":{"type":"stdin"},"ecs":{"version":"1.5.0"},"host":{"name":"ix.lan"}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simitt good point! I hadn't thought that through before: although libbeat will log a warning and continue indexing with the original document, Elasticsearch will internally expand the JSON object fields for field mapping, and this would lead to the same kind of conflict/error that we encounter in the processor, leading to the document not being indexed.
If we still drop conflicting fields in the error case, I wonder if it would be obvious enough from the document that there was an issue in processing.
@felixbarny At this point we've already decoded the JSON, and it's technically valid. I think it would be a bit surprising to then stringify the JSON and put it all back in the message
field.
Unless... instead of adding a new processor, we add an option to the json
input type instead? i.e. instead of
./filebeat -e -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true}]' -E processors='[{dot_expand:{}}]' -d '*'
we could have
./filebeat -e -c /dev/null --strict.perms=false -E output.console.enabled=true -E filebeat.inputs='[{type: stdin, enabled:true, json.keys_under_root:true, json.expand_fields:true}]' -d '*'
Then an error to expand fields would be be an error to decode, and the behaviour you describe would be natural.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having it directly in json.expand_fields
would be nice. But it would then also have to be a property in the decode_json_fields
processor.
I guess that would make it quite a bit more efficient in the case where there are no or only a few dotted fields. Is that right? That'd be nice to have, especially for loggers like ecs-dotnet as they only have |
Correct.
Ah, that's news to me. I'm going to move this back into draft as I look into the alternative of adding an option to the json input/decode_json_fields processor. I'll reconsider making it in-place at the same time. |
superseded by #22849 |
What does this PR do?
Adds a
dot_expand
processor utilizing a newExpand
method onMapStr
- the reverse ofFlatten
.Given an input log line event:
{"foo.bar": 1}
, an event with expanded fields will be published:Why is it important?
See #17021
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist
MapStr.Expand
How to test this PR locally
paste some json in and see it expanded in the output document.
Related issues
Closes #17021