Druid Catalog basics #13165

paul-rogers · 2022-10-01T05:59:31Z

The Druid catalog provides a collection of metadata "hints" about tables (datasources, input sources, views, etc.) within Druid. This PR provides the foundation: the DB and REST layer, but not yet the integration with the Calcite SQL layer. This is a much-refined version of the earlier catalog PR.

The DB layer extends what is done for other Druid metadata tables. The semantic ("business logic") layer provides the usual CRUD operations on tables. The entire design is pretty standard and follows Druid patterns. The key difference is the rather extreme lengths taken by the implementation to ensure each bit is easily testable without mocks. That means many interfaces which can be implemented in multiple ways.

Parts available in this PR include:

The metadata DB storage layer (in an extension)
The basic "catalog object model" that describes the properties and columns which describe catalog tables.
A basic set of tables: datasources and three kinds of external tables (inline, local, and HTTP).
A REST API layer to perform CRUD operations on tables.
Unit tests
An integration test of the catalog REST API.

The catalog mechanism is split into two parts.

The "core" part which describes catalog objects, and which is can model data from a variety of catalog systems.
The druid-catalog extension which stores data in the Druid metadata database.

This split exists for two reasons:

Many metadata systems exist: HMS, Amazon Glue, various commercial solutions, etc. We anticipate that some shops may wish to obtain metadata from these other systems, in the same way that some shops get their security information from external systems.
Druid's database schema evolution system is rather basic. (And, by "basic", we mean "nonexistent.") There is some chance that the remaining development will change the schema, which upgrades cannot support. Users who enable druid-catalog extension now acknowledge that they are using it only for testing, not production, and at their own risk.

Functionality not in this PR, but which will appear in the next one, includes:

The synchronization mechanism between the Coordinator and Broker.
SQL table functions to make use of catalog entries.
Integration of catalog properties to simplify MSQ ingest statements (INSERT and REPLACE).
Integration of catalog schema information with SELECT queries.
The remaining set of external table types.
Views.

This is a great opportunity for reviewers to provide guidance on the basic catalog mechanism before we start building SQL integration on top. Please see the Druid catalog issue for additional details about the goals and design of the catalog.

Update 10-17

After much debate, we decided to drop the rollup feature from the catalog. It turns out that "rollup" is a feature used for many purposes and more thought is needed to work out how users might want to use the rollup features. Compared to the Catalog proposal, the initial feature won't help with MSQ ingestion of rollup tables: users are empowered to specify on each ingestion how they want those segments to be rolled up.

A new commit pulls out the portions of the code in support of the proposed, but now dropped, rollup datasource type.

This PR has:

been self-reviewed.
has a design document here.
added documentation for new or modified features or behaviors. (Not yet: the functionality is not yet user visible.)
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

lgtm-com · 2022-10-01T07:29:30Z

This pull request introduces 8 alerts when merging c958a98 into ebfe1c0 - view on LGTM.com

new alerts:

3 for Inconsistent equals and hashCode
2 for Spurious Javadoc @param tags
1 for Unused format argument
1 for Useless null check
1 for Polynomial regular expression used on uncontrolled data

Catalog object model for tables, columns Druid metadata DB storage (as an exension) REST API to update the catalog (as an extension) Integration tests Model only: no planner integration yet

lgtm-com · 2022-10-04T07:37:23Z

This pull request introduces 6 alerts when merging bd7c8be into 7fa53ff - view on LGTM.com

new alerts:

2 for Inconsistent equals and hashCode
1 for Spurious Javadoc @param tags
1 for Unused format argument
1 for Useless null check
1 for Polynomial regular expression used on uncontrolled data

paul-rogers · 2022-10-10T22:42:45Z

Build is clean except for one unrelated flaky failure in a Kafka IT. Will rerun later as I'm sure someone will have comments that require code changes and a new build.

rohangarg

Thanks a lot for the design issue and changes @paul-rogers! 👍
I've currently taken a first pass at the core changes as of now - posting the current review as a checkpoint as I review the remaining part.
Thanks for the in-code explanations and patience as well!

processing/src/main/java/org/apache/druid/segment/column/RowSignature.java

server/pom.xml

server/src/main/java/org/apache/druid/catalog/model/CatalogUtils.java

server/src/main/java/org/apache/druid/catalog/model/table/ExternalSpec.java

server/src/main/java/org/apache/druid/catalog/model/Parameterized.java

server/src/main/java/org/apache/druid/catalog/model/facade/ColumnFacade.java

server/src/main/java/org/apache/druid/catalog/model/facade/InputTableFacade.java

server/src/main/java/org/apache/druid/catalog/model/table/MeasureTypes.java

Addressed direct review comments Removed rollup support per offline discussion

paul-rogers

@rohangarg, thanks for your review. Addressed your comments. Also revised the model to remove the "rollup" concept, leaving just the one "datasource" table type.

server/src/main/java/org/apache/druid/catalog/model/table/MeasureTypes.java

server/src/main/java/org/apache/druid/catalog/model/facade/InputTableFacade.java

server/src/main/java/org/apache/druid/catalog/model/facade/ColumnFacade.java

server/src/main/java/org/apache/druid/catalog/model/Parameterized.java

server/src/main/java/org/apache/druid/catalog/model/CatalogUtils.java

server/src/main/java/org/apache/druid/catalog/model/ColumnDefn.java

server/src/main/java/org/apache/druid/catalog/model/ColumnSpec.java

core/src/main/java/org/apache/druid/data/input/InputFormat.java

core/src/main/java/org/apache/druid/data/input/InputSource.java

core/src/main/java/org/apache/druid/java/util/common/jackson/JacksonUtils.java

FrankChen021

The change is huge and I just took a short time to look through and left some minor suggestions.

extensions-core/druid-catalog/pom.xml

...-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogListenerResource.java

extensions-core/druid-catalog/src/main/java/org/apache/druid/catalog/storage/HideColumns.java

...core/druid-catalog/src/main/java/org/apache/druid/catalog/storage/sql/SQLCatalogManager.java

FrankChen021 · 2022-10-21T02:25:29Z

...ions-core/druid-catalog/src/main/java/org/apache/druid/catalog/storage/MetastoreManager.java

+   * If SQL based, return the SQL version of the metastore
+   * connector. Throws an exception if not SQL-based.
+   */
+  SQLMetadataConnector sqlConnector();


About this interface and above isSql, I think we don't need such a specialized method defined at this interface level. A more appropriate way is to change it as

MetadataStorageConnector connector();

and remove the isSql interface.

The SQLCatalogManager checks if the returned connector is instanceof SQLMetadataConnector or not.

You are probably right. This is a work-in-progress: I have not currently set up a way to test with one of the non-SQL storage extensions. The DB integration is rather a hack: extensions can't really depend on one another, so the extension-DB code to create tables (etc.) can't know about the catalog (yet), and so that code lives here. Since it lives here, we have to know that we are, in fact, dealing with a SQL DB so the code can blow up if the storage is something else (and so this extension doesn't know how to create the table-like-thing that the target non-SQL DB needs.)

Druid is in dire need of a more general DB storage API (as well as a mechanism for DB schema evolution.)

In the same sentiment, I think the getDBI method on the SQLMetadataConnector interface feels a bit like a leak. But due to the lack of choices, I'm ok with the general current implementation.
Small styling suggestion could be to use MetadataStorage instead of Metastore to adhere to the current naming convention.

A more general issue is how we should handle metadata storage for extensions. The catalog is an extension for now. Once it is merged into the core Druid, the DB access stuff can be distributed across the SQL extensions and other extensions.

For an extension, however, the storage has to live in the extension itself. It is hard to see how, say, the catalog extension would work with, say, a Mongo DB storage extension: either that extension has to know about the catalog extension, or visa-versa.

Renamed class and combined the interface and implementation since it turned out there was only one implementation.

paul-rogers

Thank you @rohangarg and @FrankChen021 for your reviews. I've either updated the code or provided a description for each of our comments.

core/src/main/java/org/apache/druid/data/input/InputFormat.java

core/src/main/java/org/apache/druid/java/util/common/jackson/JacksonUtils.java

paul-rogers · 2022-10-24T22:24:49Z

...ions-core/druid-catalog/src/main/java/org/apache/druid/catalog/storage/MetastoreManager.java

+   * If SQL based, return the SQL version of the metastore
+   * connector. Throws an exception if not SQL-based.
+   */
+  SQLMetadataConnector sqlConnector();


You are probably right. This is a work-in-progress: I have not currently set up a way to test with one of the non-SQL storage extensions. The DB integration is rather a hack: extensions can't really depend on one another, so the extension-DB code to create tables (etc.) can't know about the catalog (yet), and so that code lives here. Since it lives here, we have to know that we are, in fact, dealing with a SQL DB so the code can blow up if the storage is something else (and so this extension doesn't know how to create the table-like-thing that the target non-SQL DB needs.)

Druid is in dire need of a more general DB storage API (as well as a mechanism for DB schema evolution.)

...core/druid-catalog/src/main/java/org/apache/druid/catalog/storage/sql/SQLCatalogManager.java

paul-rogers · 2022-10-24T22:59:33Z

server/src/main/java/org/apache/druid/catalog/model/SchemaRegistryImpl.java

+    register(new SchemaDefnImpl(
+        TableId.LOOKUP_SCHEMA,
+        ResourceType.CONFIG,
+        null // TODO


It means that this early draft does not yet include lookup tables and so there are no valid table types for the lookup schema. Ideally, that will change as the project progresses so that lookup tables can have entries in the catalog.

paul-rogers · 2022-10-24T23:01:40Z

server/src/main/java/org/apache/druid/catalog/model/SchemaRegistryImpl.java

+    register(new SchemaDefnImpl(
+        TableId.VIEW_SCHEMA,
+        ResourceType.VIEW,
+        null // TODO


Druid currently offers at multiple view systems: one via an extension, another two available from Imply. There may be more. Eventually, it will make sense to define views in the catalog, just as is done in most DB systems. The TODO here is a reminder that that work is pending once we get the basics working.

server/src/main/java/org/apache/druid/catalog/model/TableDefn.java

server/src/main/java/org/apache/druid/catalog/model/TableMetadata.java

paul-rogers · 2022-10-24T23:14:37Z

server/src/main/java/org/apache/druid/catalog/model/table/ClusterKeySpec.java

+public class ClusterKeySpec
+{
+  private final String expr;
+  private final boolean desc;


My understanding is that clustering is sorting and sorting is either ascending or descending. (The Druid implementation seems to always use nulls-first sorting.) See SortColumn in MSQ. In fact, maybe this should just reuse SortColumn, now that I know that class exists...

Can you give me a hint as to what other options we might offer in the future? That way, we can figure out how to encode them.

I believe @clintropolis is working on some cool new indexing ideas. If that results in more than one index choice per column (i.e. front-encoding or not), then we can add properties to columns for per-column choices, or to the table for cross-column choices. (The reason that clustering is a table property is that it includes more than one column.)

...-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogListenerResource.java

zachjsh · 2022-10-25T16:15:46Z

...-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogListenerResource.java

+    }
+    TableSpec spec = tableSpec.spec();
+    if (CatalogUpdateNotifier.TOMBSTONE_TABLE_TYPE.equals(spec.type())) {
+      listener.deleted(tableSpec.id());


Why not have 2 different endpoints; one for update, one for delete?

Or three: one for add, another for update, and another for delete. Then, replicate that for other catalog objects: connections, etc. Pretty soon one has a big API to maintain. Since the goal here is to simply shift delta messages, we use the "tombstone encoding" to merge the delete and the add/update cases.

The code right now is specific to tables because that's all this version of the catalog includes. Still, due to the rolling-upgrade forward/backward compatibility issue, perhaps this needs to change. Perhaps another approach is to wrap the table in an event message:

{ "objectType": "table", "action": "delete", "table": { "schema": "druid", "name": "doomed" } }

A receiver would ignore object types it doesn't understand (it means the coordinator is newer than the broker.) I'll play with that a bit to see if that will work.

...-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogListenerResource.java

extensions-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogResource.java

zachjsh · 2022-10-25T16:29:58Z

extensions-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogResource.java

+    }
+  }
+
+  private Response authorizeTable(TableId tableId, TableSpec spec, final HttpServletRequest req)


Why not use resourceFilter?

A resource filter is handy when the resource pattern is known and multiple endpoints follow the same pattern. At present, the endpoints here follow different patterns. This code is more like the Broker (have to root around in the SQL to find the resources) than, say, tasks, where it is obvious that the task ID is to be authorized, and there are multiple task endpoints.

The filter is a crude tool. Another form of reuse is the subroutine, which is the technique used here.

That said, if the redesign of the REST API results in multiple messages, then we'll consider moving the auth code to a filter if that makes the code simpler.

server/src/main/java/org/apache/druid/catalog/model/TableDefnRegistry.java

server/src/main/java/org/apache/druid/catalog/model/TableId.java

server/src/main/java/org/apache/druid/catalog/model/TableMetadata.java

server/src/main/java/org/apache/druid/catalog/model/TableSpec.java

lgtm-com · 2022-11-03T22:44:28Z

This pull request introduces 1 alert when merging 5de07ce into c875f4b - view on LGTM.com

new alerts:

1 for Spurious Javadoc @param tags

server/src/main/java/org/apache/druid/catalog/model/table/AbstractDatasourceDefn.java

server/src/main/java/org/apache/druid/catalog/model/table/ClusterKeySpec.java

server/src/main/java/org/apache/druid/catalog/model/table/ExternalTableDefn.java

server/src/main/java/org/apache/druid/catalog/model/table/HttpTableDefn.java

server/src/main/java/org/apache/druid/catalog/model/table/InputFormats.java

server/src/main/java/org/apache/druid/catalog/model/table/LocalTableDefn.java

server/src/main/java/org/apache/druid/catalog/model/table/HttpTableDefn.java

server/src/main/java/org/apache/druid/catalog/model/table/TableBuilder.java

server/src/main/java/org/apache/druid/metadata/storage/derby/DerbyMetadataStorage.java

...ns-core/druid-catalog/src/main/java/org/apache/druid/catalog/sync/CachedMetadataCatalog.java

zachjsh

Thanks for making all the changes!! Once we get the tests fix, think this is good to go.

Druid catalog basics

bd7c8be

Catalog object model for tables, columns Druid metadata DB storage (as an exension) REST API to update the catalog (as an extension) Integration tests Model only: no planner integration yet

paul-rogers force-pushed the 220930-catalog branch from c958a98 to bd7c8be Compare October 4, 2022 06:10

paul-rogers added 3 commits October 4, 2022 08:24

Build fixes

398d6f5

Fix LGTM errors

cbd46dc

Bug fix

28a2d24

kfaraz added Area - Querying Area - SQL Area - Metadata labels Oct 14, 2022

rohangarg reviewed Oct 14, 2022

View reviewed changes

paul-rogers mentioned this pull request Oct 14, 2022

Druid Catalog Proposal #12546

Open

Revisions from review comments

b3e5e00

Addressed direct review comments Removed rollup support per offline discussion

paul-rogers commented Oct 18, 2022

View reviewed changes

IT fix

b1288eb

FrankChen021 reviewed Oct 20, 2022

View reviewed changes

core/src/main/java/org/apache/druid/data/input/InputFormat.java Outdated Show resolved Hide resolved

FrankChen021 reviewed Oct 20, 2022

View reviewed changes

core/src/main/java/org/apache/druid/data/input/InputSource.java Outdated Show resolved Hide resolved

FrankChen021 reviewed Oct 20, 2022

View reviewed changes

core/src/main/java/org/apache/druid/java/util/common/jackson/JacksonUtils.java Outdated Show resolved Hide resolved

FrankChen021 reviewed Oct 20, 2022

View reviewed changes

FrankChen021 reviewed Oct 21, 2022

View reviewed changes

zachjsh self-requested a review October 24, 2022 18:58

Revisions from review comments

9f1d252

paul-rogers commented Oct 25, 2022

View reviewed changes

zachjsh reviewed Oct 25, 2022

View reviewed changes

...-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogListenerResource.java Show resolved Hide resolved

zachjsh reviewed Oct 25, 2022

View reviewed changes

...-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogListenerResource.java Outdated Show resolved Hide resolved

zachjsh reviewed Oct 25, 2022

View reviewed changes

...-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogListenerResource.java Outdated Show resolved Hide resolved

zachjsh reviewed Oct 25, 2022

View reviewed changes

extensions-core/druid-catalog/src/main/java/org/apache/druid/catalog/http/CatalogResource.java Outdated Show resolved Hide resolved

zachjsh reviewed Oct 25, 2022

View reviewed changes

zachjsh reviewed Nov 3, 2022

View reviewed changes

server/src/main/java/org/apache/druid/catalog/model/TableDefnRegistry.java Show resolved Hide resolved

zachjsh reviewed Nov 3, 2022

View reviewed changes

server/src/main/java/org/apache/druid/catalog/model/TableId.java Show resolved Hide resolved

Revision from review comments

5de07ce

zachjsh reviewed Nov 3, 2022

View reviewed changes

server/src/main/java/org/apache/druid/catalog/model/TableMetadata.java Outdated Show resolved Hide resolved

zachjsh reviewed Nov 3, 2022

View reviewed changes

server/src/main/java/org/apache/druid/catalog/model/TableSpec.java Outdated Show resolved Hide resolved