-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Filebeat] Add postgresql csv fileset #23334
[Filebeat] Add postgresql csv fileset #23334
Conversation
As a configuration option, create another pipeline inside the module to parse CSV log files. Although it's necessary to configure the database to emit such logs, there is some advantages: configured properly, some events like `statement timeout` and `lock timeout` will display the query in the same event, opposed to multiple lines, one with the error message and other with statement.
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
Pinging @elastic/integrations (Team:Integrations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution, very interesting, and quite complete!
jenkins run the tests please |
Do not generate duplicate field groups. Co-authored-by: Jaime Soriano Pastor <[email protected]>
jenkins run the tests please |
By the way, there is another approach to support multiple formats. For example the elasticsearch module contains pipelines for plain and JSON logs, and it chooses automatically which one to use. Multiple pipelines can be included like this in the fileset manifest: beats/filebeat/module/elasticsearch/server/manifest.yml Lines 15 to 18 in 8e8bfc8
The first pipeline is executed when an event is received, in this case beats/filebeat/module/elasticsearch/server/ingest/pipeline.yml Lines 15 to 20 in 8e8bfc8
This approach is also used in other modules like kibana, logstash, coredns, envoyproxy... you can look for examples that use the If we can do something similar for these postgresql logs, we could add a CSV-specific pipeline to the existing @azlev what do you think? would it be possible to do it this way in this case? |
Commit cacde06 changed field names. I commited in github without pay attention to tests. This commit fixes all tests and the pipeline.
I tried a minimal approach to avoid breaking changes, and I assumed that it would be easier to merge. So now I'll try to do a more integrated code. I'll give a try, but those changes will take a while to implement. |
* Adding postgresql.log.query_name in grok The query_step was ignored because it's a subset of command_tag * Separate message and query fields Message is the main message, and query is what message is referencing
@jsoriano I found 2 ways to choose the ingest pipeline:
I couldn't find any way to get the file name, and this will break the tests. The regular expression doesn't look as a good idea in a critical path like this, and looks fragile. Do you have other ideas? |
@azlev I agree that the simpler we can do the selection, the better, to avoid affecting the critical path. Checking the file extension could be a good option. File path should be available in Ideally, the decision of using one or the other pipeline should be based on something available in the message itself, as you are doing with the regular expression. Other possible approach could be to add a Another approach you could try is to parse as grok with the current pipeline, but if grok fails, fallback to the CSV pipeline with If we are not convinced by any of these approaches I am ok with going back to the approach of using two different filesets. |
There is no `core_id` inside PostgreSQL logs. Deprecating field and moving it to `session_line_number`.
This commit will open the path to add 2 pipelines: - The default one, using log files - The CSV pipeline Since both files represent the same information, makes sense to have only one fileset.
Creates a new grok pattern to match only the timestamp and the next char, the separator. Based on separator, choose the right pipeline.
Both of then represents the same information, so use the same pipeline to do the job.
I've split the grok pipeline in 2 parts: the first one parse the timestamp plus 1 char. This char will be used to choose the correct pipeline: In the process, I deprecated the Still, there is some fields that I need work on, the 2 that I'm currently thinking about are
|
- remove unused `thread_id` field - issue a `make update`
Drop `postgresql.log.error_severity` in favor of `log.level`
Based on PostgreSQL Error Codes, everything that not starts with `00`, `01` or `02` is an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsoriano I have a question here: https://github.com/elastic/beats/blob/master/filebeat/module/postgresql/log/ingest/pipeline.yml#L40
In the current code, if there is an error, the final event will have: "event.type": ["info", "error"]
. Is that right?
Regenerate files to have a consistent field order.
jenkins run the tests please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@azlev I think this is almost ready to merge, I have added some small questions/suggestions, let me know what you think.
GREEDYDATA: |- | ||
(.| | ||
| )* | ||
POSTGRESQL_QUERY_STEP: '(parse|bind|statement|fastpath function call|execute|execute fetch from)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query step is captured in postgresql.log.query_step
in the plain text pipeline, should it be also captured in this pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That one I was in doubt. This query_step
overlaps with command_tag
, but not 100%, so I opted to let it empty since I was creating a new fileset at first. Now that I merged, I can review this. I'll populate it, at least to some backward compatibility.
@@ -41,15 +47,22 @@ processors: | |||
field: event.type | |||
value: | |||
- info | |||
if: "ctx?.postgresql?.log?.sql_state_code == null || (ctx.postgresql.log.sql_state_code ==~ /^0[012].*/)" | |||
- append: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set
processor could be used to add event.type
instead of append
, so it is not set as an array in the document.
jenkins run the tests again please |
Change part of the description to a comment Co-authored-by: Jaime Soriano Pastor <[email protected]>
- Run update since last `fields.yml` was done only in github - Use `set` instead of `append` in pipeline to not create an array - Capture the `postgresql.log.query_step` in csv pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, let's wait for a green build. Thanks @azlev!
filebeat/module/postgresql/log/test/postgresql-13-error-code.csv.log-expected.json
Show resolved
Hide resolved
jenkins run the tests |
/test |
Now PostgreSQL has 3 files in the ingest directory: - pipeline.yml - pipeline-log.yml - pipeline-csv.yml Change the count to 3 to reflect that.
postresql.log.sql_state_code -> `postgresql.log.sql_state_code`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Only a couple of small details, sorry I didn't see them before.
description: > | ||
A possible solution to solve an error. | ||
- name: internal_query | ||
description: internal query that led to the error (if any). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please uppercase first letter of sentences in these descriptions.
description: internal query that led to the error (if any). | |
description: > | |
Internal query that led to the error (if any). |
@@ -34,7 +34,7 @@ func TestGetPipelinePath(t *testing.T) { | |||
}, | |||
{ | |||
pipelinePath: "../../module/postgresql/log/ingest", | |||
count: 1, | |||
count: 3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙂
Co-authored-by: Jaime Soriano Pastor <[email protected]>
👀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great 👍
Refactor PostgreSQL module to support logs in CSV format. The fileset has three pipelines now, first one is executed to parse the initial part of the lines, till it decides if the logs are plain text or CSV, once decided it invokes one of the other two pipelines, one specific for plain text logs and the other for CSV. Several test cases and documentation are added for CSV support. Additional fields are available now, and some others have been renamed to represent more accurately the values they store. Co-authored-by: Jaime Soriano Pastor <[email protected]> (cherry picked from commit bb78931)
Refactor PostgreSQL module to support logs in CSV format. The fileset has three pipelines now, first one is executed to parse the initial part of the lines, till it decides if the logs are plain text or CSV, once decided it invokes one of the other two pipelines, one specific for plain text logs and the other for CSV. Several test cases and documentation are added for CSV support. Additional fields are available now, and some others have been renamed to represent more accurately the values they store. (cherry picked from commit bb78931) Co-authored-by: Jaime Soriano Pastor <[email protected]> Co-authored-by: José Arthur Benetasso Villanova <[email protected]>
Merged and backported to 7.x, this will be hopefuly available in 7.12, thanks @azlev! |
…dows-7 * upstream/master: (332 commits) Use ECS v1.8.0 (elastic#24086) Add support for postgresql csv logs (elastic#23334) [Heartbeat] Refactor config system (elastic#23467) [CI] install docker-compose with retry (elastic#24069) Add nodes to filebeat-kubernetes.yaml ClusterRole - fixes elastic#24051 (elastic#24052) updating manifest files for filebeat threatintel module (elastic#24074) Add Zeek Signatures (elastic#23772) Update Beats to ECS 1.8.0 (elastic#23465) Support running Docker logging plugin on ARM64 (elastic#24034) Fix ec2 metricset fields.yml and add integration test (elastic#23726) Only build targz and zip versions of Beats if PACKAGES is set in agent (elastic#24060) [Filebeat] Add field definitions for known Netflow/IPFIX vendor fields (elastic#23773) [Elastic Agent] Enroll with Fleet Server (elastic#23865) [Filebeat] Convert logstash logEvent.action objects to strings (elastic#23944) [Ingest Management] Fix reloading of log level for services (elastic#24055) Add Agent standalone k8s manifest (elastic#23679) [Metricbeat][Kubernetes] Extend state_node with more conditions (elastic#23905) [CI] googleStorageUploadExt step (elastic#24048) Check fields are documented for aws metricsets (elastic#23887) Update go-concert to 0.1.0 (elastic#23770) ...
As a configuration option, create another pipeline inside the module to
parse CSV log files.
Although it's necessary to configure the database to emit such logs,
there is some advantages: configured properly, some events like
statement timeout
andlock timeout
will display the query in thesame event, opposed to multiple lines, one with the error message and
other with statement.
What does this PR do?
This PR creates a new fileset/pipeline inside PostgreSQL module. This new fileset parses and create events based on CSV log files from postgresql.
Why is it important?
PostgreSQL logs weren't designed to be parsed, so, without very complicated logic, it's not possible to encapsulate all information in just one event. One example of that is
statement timeout
:Using the CSV log, those 2 lines become 1:
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Author's Checklist