-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Built-in SQL #3682
Built-in SQL #3682
Conversation
For metadata, another option would be to have the coordinator have that information and the brokers to cache it from there. It could have a fallback that causes it to update its local state based on segment metadata queries if it's having trouble loading from the coordinator (or through a |
@@ -0,0 +1,73 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> | |||
<!-- | |||
~ Druid - a distributed column store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong header
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I just copied this from another pom. Will change.
@cheddar That means the coordinator would have to know how to issue queries to data nodes, right? I thought we wanted to avoid that. |
I tried to test, and got an error as follows.
Is there something to do before executing a sql? |
Try |
I moved the PostgreSQL compat stuff out of this PR, since it barely works and it was clogging up the diff. |
Thanks. It works well. |
Hrm, yeah, I can believe I can likely be quoted as saying we don't want coordinators to query. Though, at this point in time, I could be convinced that as long as the coordinator is not actively involved in the primary query path, that would be good enough. My primary concern with the updating is that you are going to generate load on the cluster for every single broker. While it's arguable if the load is a significant amount, the fact that it's a linear amount of load is what bothers me and the suggestion to consolidate on the coordinator was an attempt at making it a constant amount of load. If you can think of another way to make it a constant amount of load, I'd also be down with that. |
10feedd
to
e49137e
Compare
51d8a3f
to
419b3c0
Compare
@cheddar I can think of some ways to reduce the linearly increasing load, like caching metadata more aggressively on historicals, but can't think of a good way to do a truly constant load other than moving this to the coordinator. In the meantime I can add a config "druid.sql.enable" that you can use to turn all this off if you don't want it, which will mean that people that don't need SQL won't get the additional metadata load. Would that work for you? And even though it's linear, I don't expect the load of this metadata querying to be too large, since the SegmentMetadataQuery results should mostly be cached on historicals anyway. |
Removed the "WIP" tag since I believe this is ready for review. I updated the top comment with the current state of the patch, please refer to that for context. |
cd0f364
to
862d382
Compare
Given that this is pretty experimental, I'm fine with this coming in. I just double checked if it is opt-in and it looks like it's primarily just new classes with an entry point of a Jersey resource. That got me wondering if it should be introduced as an extension module? Either way, I'm 👍 with it being included. |
|`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M| | ||
|`druid.sql.planner.selectPageSize`|Page size threshold for [Select queries](../querying/select-query.html). Select queries for larger resultsets will be issued back-to-back using pagination.|1000| | ||
|`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinalty algorithm for `COUNT(DISTINCT foo)`.|true| | ||
|`druid.sql.planner.useApproximateTopN`|Whether to use approximate [TopN queries](../querying/topnquery.html) when a SQL query could be expressed as such. If false, exact [GroupBy queries](../querying/groupbyquery.html) will be used instead.|true| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we switch this at query time ? maybe adding some runtime config endpoints ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b-slim that seems useful although right now there is no mechanism for adjusting planner configs at runtime. It could probably be added in a future PR.
- [Query-time lookups](lookups.html). | ||
- [Nested groupBy queries](groupbyquery.html#nested-groupbys). | ||
- Extensions, including [approximate histograms](../development/extensions-core/approximate-histograms.html) and | ||
[DataSketches](../development/extensions-core/datasketches-aggregators.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we support HLL not sure why DataSketches can not be supported ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@b-slim just because HLL is in core. I haven't added an extension point yet (was hoping to do that in a future PR) and so only features in core Druid are usable right now.
👍 Given that is has minimal impact and has been tested |
@cheddar, re:
The idea behind making it a core module is that I thought that at some point we would want to add an extension point for SQL UDFs to datasketches, approximate histograms, etc. To make that work, either the UDFs would need to be in the various extensions and they would need to require druid-sql, or the UDFs would need to be in druid-sql and it would need to require the various extensions. The former made more sense to me, and given that implies druid-sql will one day be a dependency of a variety of extensions, it seemed like it belongs in core. Otherwise we get into extensions requiring other extensions that seem only lightly related, and that would be confusing to users. If you have a better idea about how to handle extensions / UDFs with SQL then I am all ears. |
Fixed conflicts with master. |
@praveev could you raise a separate issue for this please? Does adding exclusions to the pom help? |
This is code corresponding to the proposal at: https://groups.google.com/d/msg/druid-development/3npt9Qxpjr0/F-t--qMNBQAJ
It includes:
There's 2 ways of issuing SQL queries, both on the broker:
SQL querying is disabled by default in this PR, since enabling it causes brokers to make some extra metadata queries. I expect the extra load will be small, but I'm wanting to be conservative.
This depends on the latest Calcite (1.10.0) although there are some features that will only work when 1.11.0 is released. These are marked with
CALCITE_1_11_0
in CalciteQueryTest.Benchmarks shows minimal overhead for our benchmark queries. I expect there to be some overhead since SQL planning takes some time, and also we're going through Calcite's JDBC adapter on the output side. That could potentially be bypassed in a future patch.
To consider for future patches:
The original PR had some questions, which are answered in the latest diff: