Built-in SQL #3682

gianm · 2016-11-11T19:46:01Z

This is code corresponding to the proposal at: https://groups.google.com/d/msg/druid-development/3npt9Qxpjr0/F-t--qMNBQAJ

It includes:

Calcite-based parser and planner.
Documentation in https://github.com/gianm/druid/blob/sql/docs/content/querying/sql.md and https://github.com/gianm/druid/blob/sql/docs/content/configuration/broker.md
A semi-join implementation that is goofy but lovable.
Metadata support through SegmentMetadataQueries.
Test suite with lots of test SQLs (including some that are known not to work).

There's 2 ways of issuing SQL queries, both on the broker:

JSON API on /druid/v2/sql/
Avatica server at /druid/v2/sql/avatica/

SQL querying is disabled by default in this PR, since enabling it causes brokers to make some extra metadata queries. I expect the extra load will be small, but I'm wanting to be conservative.

This depends on the latest Calcite (1.10.0) although there are some features that will only work when 1.11.0 is released. These are marked with CALCITE_1_11_0 in CalciteQueryTest.

Benchmarks shows minimal overhead for our benchmark queries. I expect there to be some overhead since SQL planning takes some time, and also we're going through Calcite's JDBC adapter on the output side. That could potentially be bypassed in a future patch.

Benchmark                 (rowsPerSegment)  Mode  Cnt     Score    Error  Units
SqlBenchmark.queryNative             10000  avgt   30    16.588 ±  0.283  ms/op
SqlBenchmark.queryNative            100000  avgt   30   920.364 ± 18.956  ms/op
SqlBenchmark.queryNative            200000  avgt   30  3604.093 ± 58.809  ms/op
SqlBenchmark.querySql                10000  avgt   30    35.641 ± 30.452  ms/op
SqlBenchmark.querySql               100000  avgt   30   960.524 ± 60.775  ms/op
SqlBenchmark.querySql               200000  avgt   30  3665.385 ± 48.119  ms/op

To consider for future patches:

Make the SQL language extendable, so we can add new functions in extensions.
Potentially bypass Calcite's JDBC adapter, just using it for parsing and planning.
Support for more Druid and SQL features.

The original PR had some questions, which are answered in the latest diff:

Should SQL querying be enabled by default? (Latest diff: no, it's not enabled by default)
Should background metadata fetching be enabled by default? See DruidSchema.java for how this is done. Basically the broker will issue SegmentMetadataQueries (with a throttle) when it notices new dataSources or new segments. There's a case to be made that even with a throttle, we shouldn't enable this by default (principle of least surprise for upgrading clusters). There's also a case to be made that we should enable it by default, and call out in the release notes how to turn it off (principle of ease of use for new users). (Latest diff: background metadata fetching is enabled if SQL is, but SQL is disabled by default)
If background metadata fetching is turned off, what happens when someone issues a SQL query? Do we just fetch it on demand? (Latest diff: background metadata fetching is always enabled if SQL is enabled)

cheddar · 2016-11-11T22:40:28Z

For metadata, another option would be to have the coordinator have that information and the brokers to cache it from there. It could have a fallback that causes it to update its local state based on segment metadata queries if it's having trouble loading from the coordinator (or through a desc SQL command or something)

fjy · 2016-11-11T22:59:13Z

sql/pom.xml

@@ -0,0 +1,73 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+  ~ Druid - a distributed column store.


wrong header

Ah, I just copied this from another pom. Will change.

gianm · 2016-11-12T01:46:04Z

@cheddar That means the coordinator would have to know how to issue queries to data nodes, right? I thought we wanted to avoid that.

jihoonson · 2016-11-12T05:58:01Z

I tried to test, and got an error as follows.

$ cat test_metadata.json 
{
  "query" : "SELECT * FROM metadata.TABLES WHERE tableType = ?",
  "parameters" : ["TABLE"]
}

$ curl -XPOST -H'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d @test_metadata.json
[{"tableCat":null,"tableSchem":"druid","tableName":"wikiticker","tableType":"TABLE","remarks":null,"typeCat":null,"typeSchem":null,"typeName":null,"selfReferencingColName":null,"refGeneration":null}]

$ curl -XPOST -H'Content-Type: application/json' http://localhost:8082/druid/v2/sql/ -d '{"query":"SELECT COUNT(*) FROM wikiticker"}'
{"error":"Unknown exception","errorMessage":"From line 1, column 22 to line 1, column 31: Table 'wikiticker' not found","errorClass":"org.apache.calcite.runtime.CalciteContextException","host":null}

Is there something to do before executing a sql?

gianm · 2016-11-12T17:35:41Z

Try SELECT COUNT(*) FROM druid.wikiticker instead, you need the "druid." part.

gianm · 2016-11-12T21:52:23Z

I moved the PostgreSQL compat stuff out of this PR, since it barely works and it was clogging up the diff.

jihoonson · 2016-11-13T08:46:54Z

Thanks. It works well.

cheddar · 2016-11-15T18:25:46Z

Hrm, yeah, I can believe I can likely be quoted as saying we don't want coordinators to query. Though, at this point in time, I could be convinced that as long as the coordinator is not actively involved in the primary query path, that would be good enough.

My primary concern with the updating is that you are going to generate load on the cluster for every single broker. While it's arguable if the load is a significant amount, the fact that it's a linear amount of load is what bothers me and the suggestion to consolidate on the coordinator was an attempt at making it a constant amount of load. If you can think of another way to make it a constant amount of load, I'd also be down with that.

gianm · 2016-12-05T18:47:50Z

@cheddar I can think of some ways to reduce the linearly increasing load, like caching metadata more aggressively on historicals, but can't think of a good way to do a truly constant load other than moving this to the coordinator.

In the meantime I can add a config "druid.sql.enable" that you can use to turn all this off if you don't want it, which will mean that people that don't need SQL won't get the additional metadata load. Would that work for you?

And even though it's linear, I don't expect the load of this metadata querying to be too large, since the SegmentMetadataQuery results should mostly be cached on historicals anyway.

gianm · 2016-12-05T23:41:41Z

Removed the "WIP" tag since I believe this is ready for review. I updated the top comment with the current state of the patch, please refer to that for context.

cheddar · 2016-12-09T23:45:28Z

Given that this is pretty experimental, I'm fine with this coming in. I just double checked if it is opt-in and it looks like it's primarily just new classes with an entry point of a Jersey resource.

That got me wondering if it should be introduced as an extension module?

Either way, I'm 👍 with it being included.

b-slim · 2016-12-10T00:39:09Z

docs/content/configuration/broker.md

+|`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
+|`druid.sql.planner.selectPageSize`|Page size threshold for [Select queries](../querying/select-query.html). Select queries for larger resultsets will be issued back-to-back using pagination.|1000|
+|`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinalty algorithm for `COUNT(DISTINCT foo)`.|true|
+|`druid.sql.planner.useApproximateTopN`|Whether to use approximate [TopN queries](../querying/topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](../querying/groupbyquery.html) will be used instead.|true|


can't we switch this at query time ? maybe adding some runtime config endpoints ?

@b-slim that seems useful although right now there is no mechanism for adjusting planner configs at runtime. It could probably be added in a future PR.

b-slim · 2016-12-10T00:44:08Z

docs/content/querying/sql.md

+- [Query-time lookups](lookups.html).
+- [Nested groupBy queries](groupbyquery.html#nested-groupbys).
+- Extensions, including [approximate histograms](../development/extensions-core/approximate-histograms.html) and
+[DataSketches](../development/extensions-core/datasketches-aggregators.html).


if we support HLL not sure why DataSketches can not be supported ?

@b-slim just because HLL is in core. I haven't added an extension point yet (was hoping to do that in a future PR) and so only features in core Druid are usable right now.

fjy · 2016-12-12T22:52:39Z

👍 Given that is has minimal impact and has been tested

gianm · 2016-12-13T23:21:07Z

@cheddar, re:

That got me wondering if it should be introduced as an extension module?

The idea behind making it a core module is that I thought that at some point we would want to add an extension point for SQL UDFs to datasketches, approximate histograms, etc. To make that work, either the UDFs would need to be in the various extensions and they would need to require druid-sql, or the UDFs would need to be in druid-sql and it would need to require the various extensions.

The former made more sense to me, and given that implies druid-sql will one day be a dependency of a variety of extensions, it seemed like it belongs in core. Otherwise we get into extensions requiring other extensions that seem only lightly related, and that would be confusing to users.

If you have a better idea about how to handle extensions / UDFs with SQL then I am all ears.

gianm · 2016-12-16T23:10:03Z

Fixed conflicts with master.

praveev · 2016-12-23T01:58:19Z

@gianm I believe there is a conflict with jackson-databind 2.6.3 from avatica.jar (comes from calcite-core) and druid's explicit jackson-databind 2.4.6 dependency. I see issues during batch ingestion. Realtime using tranquility is fine. Here is a gist containing the stack trace

gianm · 2016-12-23T15:12:51Z

@praveev could you raise a separate issue for this please? Does adding exclusions to the pom help?

praveev · 2016-12-23T22:31:18Z

@gianm I've created #3800 to track this. Exclusions didn't help.

gianm added the Discuss label Nov 11, 2016

gianm force-pushed the sql branch from 110ff6e to 171b3b0 Compare November 11, 2016 19:48

gianm changed the title ~~Built-in SQL~~ [WIP] Built-in SQL Nov 11, 2016

fjy reviewed Nov 11, 2016

View reviewed changes

fjy added this to the 0.9.3 milestone Nov 11, 2016

fjy added the Feature label Nov 11, 2016

gianm force-pushed the sql branch from f0cd81c to 95a34d9 Compare November 12, 2016 21:48

gianm force-pushed the sql branch 6 times, most recently from 10feedd to e49137e Compare December 1, 2016 01:16

This was referenced Dec 1, 2016

Add DimFilterHavingSpec. #3727

Merged

Add "strlen" extractionFn. #3731

Merged

Add "asMillis" option to "timeFormat" extractionFn. #3733

Merged

gianm force-pushed the sql branch 2 times, most recently from 51d8a3f to 419b3c0 Compare December 2, 2016 22:00

gianm force-pushed the sql branch from 419b3c0 to c43ec32 Compare December 5, 2016 22:01

gianm removed the Discuss label Dec 5, 2016

gianm changed the title ~~[WIP] Built-in SQL~~ Built-in SQL Dec 5, 2016

gianm force-pushed the sql branch 3 times, most recently from cd0f364 to 862d382 Compare December 6, 2016 23:43

b-slim reviewed Dec 10, 2016

View reviewed changes

fjy assigned cheddar and fjy Dec 13, 2016

Built-in SQL.

754b751

gianm force-pushed the sql branch from 08490f6 to 754b751 Compare December 16, 2016 23:09

fjy merged commit dd63f54 into apache:master Dec 17, 2016

gianm deleted the sql branch December 19, 2016 18:18

praveev mentioned this pull request Dec 23, 2016

Jackson version conflict with avatica #3800

Closed

dgolitsyn pushed a commit to metamx/druid that referenced this pull request Feb 14, 2017

Built-in SQL. (apache#3682)

a4c430e

gianm added the Release Notes label Feb 23, 2017

gianm mentioned this pull request Feb 28, 2017

Druid 0.10.0 release notes #3944

Closed

jihoonson mentioned this pull request Jul 12, 2017

Protobuf extension #4039

Merged

clambertus unassigned cheddar and fjy Jul 6, 2018

yurmix mentioned this pull request May 29, 2019

Enable SQL by default #7793

Closed

seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020

backport of apache#3682 (Built-in SQL)

18cc84d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built-in SQL #3682

Built-in SQL #3682

gianm commented Nov 11, 2016 •

edited

Loading

cheddar commented Nov 11, 2016 •

edited

Loading

fjy Nov 11, 2016

gianm Nov 11, 2016

gianm commented Nov 12, 2016

jihoonson commented Nov 12, 2016

gianm commented Nov 12, 2016

gianm commented Nov 12, 2016

jihoonson commented Nov 13, 2016

cheddar commented Nov 15, 2016

gianm commented Dec 5, 2016 •

edited

Loading

gianm commented Dec 5, 2016

cheddar commented Dec 9, 2016 •

edited

Loading

b-slim Dec 10, 2016

gianm Dec 13, 2016

b-slim Dec 10, 2016

gianm Dec 13, 2016

fjy commented Dec 12, 2016

gianm commented Dec 13, 2016

gianm commented Dec 16, 2016

praveev commented Dec 23, 2016 •

edited

Loading

gianm commented Dec 23, 2016

praveev commented Dec 23, 2016

Built-in SQL #3682

Built-in SQL #3682

Conversation

gianm commented Nov 11, 2016 • edited Loading

cheddar commented Nov 11, 2016 • edited Loading

fjy Nov 11, 2016

Choose a reason for hiding this comment

gianm Nov 11, 2016

Choose a reason for hiding this comment

gianm commented Nov 12, 2016

jihoonson commented Nov 12, 2016

gianm commented Nov 12, 2016

gianm commented Nov 12, 2016

jihoonson commented Nov 13, 2016

cheddar commented Nov 15, 2016

gianm commented Dec 5, 2016 • edited Loading

gianm commented Dec 5, 2016

cheddar commented Dec 9, 2016 • edited Loading

b-slim Dec 10, 2016

Choose a reason for hiding this comment

gianm Dec 13, 2016

Choose a reason for hiding this comment

b-slim Dec 10, 2016

Choose a reason for hiding this comment

gianm Dec 13, 2016

Choose a reason for hiding this comment

fjy commented Dec 12, 2016

gianm commented Dec 13, 2016

gianm commented Dec 16, 2016

praveev commented Dec 23, 2016 • edited Loading

gianm commented Dec 23, 2016

praveev commented Dec 23, 2016

gianm commented Nov 11, 2016 •

edited

Loading

cheddar commented Nov 11, 2016 •

edited

Loading

gianm commented Dec 5, 2016 •

edited

Loading

cheddar commented Dec 9, 2016 •

edited

Loading

praveev commented Dec 23, 2016 •

edited

Loading