Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metricbeat] Add SQL module (#13106) #13257

Merged
merged 1 commit into from
Dec 11, 2019
Merged

[Metricbeat] Add SQL module (#13106) #13257

merged 1 commit into from
Dec 11, 2019

Conversation

amandahla
Copy link

@amandahla amandahla commented Aug 15, 2019

This is a draft for a odbc module that you can set datasource and a query to generate metrics.

This way, you can get metrics from DB2 (#13106) and others databases.

PR is in progress and subject to changes.

How to test

  • Start a database with a driver included in Metricbeat (MySQL, PostgreSQL, Oracle).
  • Configure metricbeat with a DSN for this database, and a query, for example for mysql:
- module: sql
  metricsets: ['query']
  hosts: ['root:test@tcp(172.22.0.3:3306)/']
  sql_query: ['select table_schema, table_name, engine, table_rows from information_schema.tables where table_rows > 0']
  • Check that it can collect data.

@amandahla amandahla requested a review from a team as a code owner August 15, 2019 17:47
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

1 similar comment
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@ghost
Copy link

ghost commented Aug 15, 2019

Hi @amandahla, we have found your signature in our records, but it seems like you have signed with a different e-mail than the one used in your Git commit. Can you please add both of these e-mails into your Github profile (they can be hidden), so we can match your e-mails to your Github profile?

@jsoriano jsoriano added Team:Integrations Label for the Integrations team enhancement Metricbeat Metricbeat module in progress Pull request is currently in progress. labels Aug 15, 2019
@jsoriano
Copy link
Member

Thanks @amandahla for starting this!

We have long wanted to have something like a generic SQL module, and now with light modules it can be even more useful.

I see this is in progress, but my main concern with current approach is the use of ODBC. According to the package README it needs CGO, what can be an impediment. I guess you have chosen it because there is no go-native way to connect with a DB2 database?

Also, I think that we could make this one a more generic module if we also allow to configure the driver.

metricbeat.modules:
- module: sql
  metricsets: ["query"]
  hosts: ["localhost"]
  driver: "sqldriver"
  datasource: "sqldsn"
  query: "select metric from mymetrics"

And then we could open the connection with:

 	db, err := sqlx.Open(m.Driver, m.Datasource)

The benefit of this is that we could have an even more generic sql module that could use the drivers we already have (mysql, postgresql...), and we could also postpone the decission of including the cgo ODBC driver. I would also rename the module from odbc to sql.

We will also have to discuss how to store the collected metrics. In generic modules we are tending towards storing all the metrics under the same object, for example all Prometheus metrics are stored under prometheus.metrics.*, we could think in something like this for this SQL metrics. The main difference here is that metrics are typed, so maybe we should store them under sql.metrics.<type>.*.

One last thing, we are adding most of new modules under the x-pack directory, would you mind to move the module to x-pack/metricbeat/module?

@amandahla
Copy link
Author

@jsoriano

  1. I guess you have chosen it because there is no go-native way to connect with a DB2 database?
    Unfortunately, yes. I have used this list as reference and didn't find any other driver, for ODBC or just for DB2, without CGO. :-(

  2. Include driver in configuration and rename to sql.
    Sure, good idea!

  3. "We will also have to discuss how to store the collected metrics. In generic modules we are tending towards storing all the metrics under the same object, for example all Prometheus metrics are stored under prometheus.metrics., we could think in something like this for this SQL metrics. The main difference here is that metrics are typed, so maybe we should store them under sql.metrics..."
    I'll have to study that but agree.

  4. One last thing, we are adding most of new modules under the x-pack directory, would you mind to move the module to x-pack/metricbeat/module?
    Oops...I didn't notice that, I'll change this.

@amandahla
Copy link
Author

  1. Include driver in configuration and rename to sql
    Thinking better about it...Since no database drivers are included in the Go standard library, I guess I would have to import a lot of drivers here to allow this configuration to be anything as odbc, postgresql, mysql...

@jsoriano
Copy link
Member

jsoriano commented Aug 16, 2019

Thinking better about it...Since no database drivers are included in the Go standard library, I guess I would have to import a lot of drivers here to allow this configuration to be anything as odbc, postgresql, mysql...

We already import several drivers in Metricbeat for other modules, I think it should be possible to use them here. And later we can decide the best way of including more drivers.

@amandahla
Copy link
Author

@jsoriano should be something like this?

event: {
  "@timestamp": "2019-08-27T15:08:59.696Z",
  "@metadata": {
    "beat": "metricbeat",
    "type": "_doc",
    "version": "8.0.0"
  },
  "host": {
    "name": "metricbeat"
  },
  "agent": {
    "ephemeral_id": "31aae8f5-499f-42ef-9dd2-4373832ec81f",
    "hostname": "1565454",
    "id": "05155846-1ed3-4c71-be89-31491e81a15b",
    "version": "8.0.0",
    "name": "metricbeat",
    "type": "metricbeat"
  },
  "sql": {
    "query": {
      "metrics": [
        {
          "TBSPACE": "SYSTOOLSPACE"
        },
        {
          "PAGESIZE": "4096"
        },
        {
          "EXTENTSIZE": "4"
        }
      ]
    }
  },
  "event": {
    "duration": 1637527231,
    "dataset": "sql.query",
    "module": "sql"
  },
  "metricset": {
    "name": "query",
    "period": 10000
  },
  "service": {
    "address": "http://127.0.0.1",
    "type": "sql"
  },
  "ecs": {
    "version": "1.1.0"
  }
}

@jsoriano
Copy link
Member

@amandahla yes, something like this, some extra suggestions:

  • We should avoid using arrays of objects as it is difficult to use them well in Elasticsearch, and we don't use them in Beats. See the note about this in the Array datatype docs. In this case I think it should be easy to convert them to plain fields as they are objects with an only element.
  • Not very important, but we could lowercase the field names, maybe with an option enabled by default.
  • We could consider adding some extra fields as metadata, for example we could add the statement executed, and the driver used.
  • If we can distinguish the type of the fields in the response we could use different paths per type category, so we can provide default mappings. On an initial version we could distinguish only numeric fields that can be stored as double, and stored the fields of the rest of types as keywords.

Summarizing, something like this:

  ...
  "sql": {
    "driver": "odbc",
    "query": "select ..."
    "metrics": {
      "numeric": {
        "pagesize": 4096,
        "extentsize": 4
      },
      "string": {
        "tbspace": "SYSTOOLSPACE"
      }
    }
  },
  ...

@amandahla
Copy link
Author

Almost there :-)

"sql": {
    "query": {
      "metrics": {
        "numeric": {
          "extentsize": 4,
          "pagesize": 4096
        },
        "string": {
          "tbspace": "SYSTOOLSPACE"
        }
      },
      "driver": "odbc",
      "query": "select TBSPACE,PAGESIZE,EXTENTSIZE from syscat.tablespaces"
    }
  },

@jsoriano
Copy link
Member

Great! Could you remove the query level?

    "sql": {
      "metrics": {
        "numeric": {
          "extentsize": 4,
          "pagesize": 4096
        },
        "string": {
          "tbspace": "SYSTOOLSPACE"
        }
      },
      "driver": "odbc",
      "query": "select TBSPACE,PAGESIZE,EXTENTSIZE from syscat.tablespaces"
    },

@dedemorton
Copy link
Contributor

dedemorton commented Sep 13, 2019

Can you ping me when the documentation for this module is ready for review? Right now, the content is boiler plate text. We need meaningful descriptions. Thanks!

@amandahla
Copy link
Author

@dedemorton I'm not sure if I did it right, I just wrote the files:

x-pack/metricbeat/module/sql/_meta/docs.asciidoc
x-pack/metricbeat/module/sql/query/_meta/docs.asciidoc

Is that ok?

@amandahla
Copy link
Author

@dedemorton @jsoriano is there anything that I could improve/change here?

@jsoriano
Copy link
Member

jsoriano commented Oct 9, 2019

Hey @amandahla, sorry for not getting back to this earlier, there seems to be some errors in CI related to the new added doc files, you will need to run mage fmt update and commit the changes, you may need to run it from x-pack/metricbeat and from metricbeat directories.

@zube zube bot added the [zube]: Done label Dec 11, 2019
jsoriano pushed a commit to jsoriano/beats that referenced this pull request Dec 11, 2019
The SQL module can collect metrics and other data with
custom queries, using the database drivers included in
Metricbeat.

(cherry picked from commit a3136a4)
@jsoriano jsoriano removed the needs_backport PR is waiting to be backported to other branches. label Dec 11, 2019
jsoriano added a commit that referenced this pull request Dec 12, 2019
The SQL module can collect metrics and other data with
custom queries, using the database drivers included in
Metricbeat.

(cherry picked from commit a3136a4)
@jsoriano jsoriano added test-plan Add this PR to be manual test plan and removed in progress Pull request is currently in progress. labels Jan 14, 2020
@jsoriano jsoriano changed the title [Metricbeat] [WIP] Add SQL module (#13106) [Metricbeat] Add SQL module (#13106) Jan 16, 2020
@jsoriano jsoriano self-assigned this Jan 20, 2020
@jsoriano
Copy link
Member

While manually testing this module I have found that the behaviour of hosts and datasource options were a bit problematic, I am addressing this and adding integration tests in #15686.

@jsoriano
Copy link
Member

I have started this change too #15847. It'd be nice if it reaches 7.6.0, but not so important.

@dagwieers
Copy link

dagwieers commented Nov 27, 2020

The metricbeat sql module is a lifesaver, we are very happy with this inclusion. \o/

But we think there are a few features that could make this even better.

Support a unique identifier for _id

We like to perform queries and updates docs in Elastic. For this we would need to provide the _id value, so it would be nice if we could configure which field is to be used as the _id value. This would allow us to have a rolling window for updating documents (e.g. every hour process the aggregated data for the previous X hours) and have it update those entries in Elastic.

I may be possible to related this to the primary key of a table. But a configurable column would be more convenient IMO

Support an identifier for the @timestamp

We already use a timestamp provided in the query as the timeField in the Index pattern, and this works very well already. But I think it could be useful to have the default @timestamp selected from the query as well so this works out of the box.

For metrics at the time of the query this offers no value, but for time-based aggregations in tables, having a way to influence @timestamp without the need to have ingest pipelines would be very useful.

Storing and using the timestamp of the last successful ingest

We are processing aggregated data grouped by hour, the first time we would like to get to the historical information (all information), but subsequent queries should only cover the most recent windows. In fact, if there would be a variable that would hold the timestamp of the last queried/ingested time we could use this to calculate the timeframe we need to process. In this case if we miss to run the queries in a timely fashion (agent was down) it would pick up where it left off (just like filebeat does when processing logfiles).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Metricbeat Metricbeat module Team:Integrations Label for the Integrations team test-plan Add this PR to be manual test plan test-plan-ok This PR passed manual testing v7.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants