Foundation for the Druid metadata catalog #12647

paul-rogers · 2022-06-14T05:30:35Z

The Druid catalog provides a collection of metadata "hints" about tables (datasources, input sources, views, etc.) within Druid. This PR provides the foundation: the DB and REST layer, but not yet the integration with the Calcite SQL layer.

The DB layer extends what is done for other Druid metadata tables. The semantic ("business logic") layer provides the usual CRUD operations on tables, as well as operations to sync metadata between the Coordinator and Broker. A synchronization layer handles the Coordinator/Broker sync: the Broker polls for the information it does not yet have: the Coordinator pushes updates to known Brokers.

The entire design is pretty standard and follows Druid patterns. The key difference is the rather extreme lengths taken by the implementation to ensure each bit is easily testable without mocks. That means many interfaces which can be implemented in multiple ways.

While the entire catalog mechanism is present in this PR, the Guice configuration is not yet enabled, meaning that the catalog is not yet enabled. This project has created, or depends on, multiple in-flight PRs and it is becoming a bit complex to combine them all in a private branch. This is one of several PRs that provide slices of catalog work.

We'll want to create integration tests when we enable the feature, and that work is waiting for the new IT PR to be merged.

The next step in the catalog work is to integrate the catalog with the Druid planner. For that, we'll need the planner test framework to be merged.

This code will likely evolve as we work on the SQL layer. Some of that work has already been done in a private branch and suggests that the present code is pretty much on the right track: we'll just expand the table and column definitions as needed.

This is a great opportunity for reviewers to provide guidance on the basic catalog mechanism before we start building SQL integration on top.

This PR has:

been self-reviewed.
has a design document here.
added documentation for new or modified features or behaviors. (Not yet: the functionality is not yet user visible.)
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests. (Not yet, waiting for this PR to be merged.)
been tested in a test Druid cluster. (A simple one, on a Mac, using a Python client to verify the API.)

lgtm-com · 2022-06-14T07:08:29Z

This pull request introduces 1 alert when merging 44ad179 into afaea25 - view on LGTM.com

new alerts:

1 for Spurious Javadoc @param tags

Provides the DB and REST layer, but not yet the integration with the Calcite SQL layer.

paul-rogers · 2022-06-27T21:23:02Z

Will continue this in a private branch.

gianm

Initial review focused on CatalogResource, TableSpec, and its subclasses DatasourceSpec + InputTableSpec. Those seem like the central classes, so I wanted to focus on them.

I haven't looked at most of the classes in detail yet beyond the ones I mentioned above, but, the general structure of things does look good to me.

server/src/main/java/org/apache/druid/server/http/CatalogResource.java

gianm · 2022-06-28T06:37:32Z

server/src/main/java/org/apache/druid/catalog/DatasourceSpec.java

+      @JsonProperty("rollupGranularity") String rollupGranularity,
+      @JsonProperty("targetSegmentRows") int targetSegmentRows,
+      @JsonProperty("enableAutoCompaction") boolean enableAutoCompaction,
+      @JsonProperty("autoCompactionDelay") String autoCompactionDelay,


Is this meant to be a replacement or alternative for (certain parts of) DataSourceCompactionConfig?

How will we reconcile what's here with any existing DataSourceCompactionConfig?

This is still preliminary: just working out the storage and REST layers at the moment. For this one, the thought is if there is a compaction spec, that takes precedence. If the spec exists, but leaves out this property, this is the value we use. If the spec doesn't exist, this info takes over (for the simple case of direct compaction.) Details TBD when we get that far.

gianm · 2022-06-28T06:39:27Z

server/src/main/java/org/apache/druid/catalog/DatasourceSpec.java

+
+  public DatasourceSpec(
+      @JsonProperty("segmentGranularity") String segmentGranularity,
+      @JsonProperty("rollupGranularity") String rollupGranularity,


Generally we call this queryGranularity. But! What do you think about getting rid of it? I've started to think that it makes more sense to represent this as an explicit TIME_FLOOR function call rather than a table property.

Good points. I was always confused by the "query" granularity: it has little to do with a query: it is the ingest/compaction time granularity. Hence my attempt to sneak in a different name. In fact, I could imagine having a "true" query granularity as s separate field: we decided to apply rollup of 1m, but used to have 1s. To get consistent results, use a query-time granularity of 1m, even for the older 1s segments. But, that's just a whim, not implemented here (because it would take additional query work.)

As to the usage: I'm aware of the discussion. The other half of the argument is that other dimensions might be similarly trimmed. Geo data might be rounded to a city level. Sales data rounded from (store, cashier, lane) to just (store). In this case, time is similarly rounded.

Since this is a prototype, the thought was that time is special: it is what enables other forms of compression. If the rollup granularity is missing, Druid stores data at the detail level, even if I do the other dimension rounding. Only by making a rollup grain of some actual time (even 1ms), do I get rollup. So, time is special.

The other approach, when we work out the details of dimensions and measures, is to drop this field, add a "rollupEnabled" field, and require the user to specify the grain via a TIME_FLOOR attached to the __time column, in parallel with the SUM(LONG) attached to a measure.

gianm · 2022-06-28T06:39:55Z

server/src/main/java/org/apache/druid/catalog/DatasourceSpec.java

+      @JsonProperty("targetSegmentRows") int targetSegmentRows,
+      @JsonProperty("enableAutoCompaction") boolean enableAutoCompaction,
+      @JsonProperty("autoCompactionDelay") String autoCompactionDelay,
+      @JsonProperty("properties") Map<String, Object> properties,


When is something a property vs. a top-level, named thing? Do you have any properties in mind?

Properties are "extensions" things provided by other than Druid itself. For example, I might want to track if the column contains PII. Or, might want to track the input source that defined the column. Or, might want to add info about the kind of UI widget to use to display it. Rather than creating my own parallel schema for such use cases, I just add a custom property. We might define a naming convention "com.foo.is-pii", or "org.whiz.lineage.input-source". Druid doesn't understand them, but the external tool (or user) does.

gianm · 2022-06-28T06:48:01Z

server/src/main/java/org/apache/druid/catalog/DatasourceColumnSpec.java

+    public MeasureSpec(
+        @JsonProperty("name") String name,
+        @JsonProperty("sqlType") String sqlType,
+        @JsonProperty("aggregateFn") String aggregateFn


What kind of string is meant to be in here? How are we going to interpret it?

I'm especially interested in this because I've been contemplating recently what's the best way to write an INSERT or REPLACE into a rollup table.

gianm · 2022-06-28T06:55:54Z

server/src/main/java/org/apache/druid/catalog/Actions.java

+/**
+ * Helper functions for the catalog REST API actions.
+ */
+public class Actions


Are you thinking this class would be useful for other server APIs in the future? It seems written in such a way that it would be.

Yes. I've already written variations of these several times here and there in Druid. Making it more general is left as a later exercise to minimize the size of this PR.

gianm · 2022-06-28T06:56:59Z

server/src/main/java/org/apache/druid/catalog/InputTableSpec.java

+
+  public InputTableSpec(
+      @JsonProperty("inputSource") InputSource inputSource,
+      @JsonProperty("format") InputFormat format,


I'd go with inputSource + inputFormat, for rhyming purposes, & because it's what indexing tasks and ExternalDataSource do.

OK. I was kind of leaning toward the Go pattern: terseness. The only "format" it could possibly be is the "input format". Same is true for "input source", but I didn't spend time to simplify it yet. Same reason the other fields are not tableColumns and userDefinedProperties. Changed it to inputFormat for now; we can clean it up in a renaming pass later.

gianm · 2022-06-28T06:59:17Z

server/src/main/java/org/apache/druid/catalog/InputTableSpec.java

+  }
+
+  @JsonProperty("inputSource")
+  public InputSource inputSource()


Will there be a way to parameterize the source somehow at runtime? I feel that a prime use case for input tables is going to be incremental ingestion, meaning the input source will change from statement to statement.

Absolutely; that is the next bit of work to be done. Having fun with the old "what SQL syntax can we use for this non-standard concept" game. Current thought is a Calcite macro, something like:

SELECT * FROM TABLE(INPUT(myInputTable, files = "foo.csv, bar.csv"))

With the bits and pieces adjusted to fit Calcite's existing constraints. Would love to mimic Snowflake:

SELECT * FROM myInputTable(files = "foo.csv, bar.csv")

But Calcite has already grabbed that syntax to specify columns:

FROM myInputTable(a VARCHAR, b INT, c DOUBLE)

Antway, this is work in progress, details to be discussed separately.

gianm · 2022-06-28T07:00:45Z

server/src/main/java/org/apache/druid/server/http/CatalogResource.java

+                  table.dbSchema(),
+                  table.name()));
+      } else {
+        return Actions.okWithVersion(0);


Should this return the current version of the already-existing table?

It could, but that would take an extra DB read. This feature mimics the SQL CREATE TABLE myTempTable IF NOT EXISTS use case where all we want is to not fail if we've already done this step; we typically don't then change the definition.

If we do want to change anything, we've got to read the existing values, which would provide the version. By not providing the version here, we save a DB read internally, since the failed SQL INSERT didn't fetch it.

gianm · 2022-06-28T07:05:05Z

server/src/main/java/org/apache/druid/server/http/CatalogResource.java

+      @PathParam("dbSchema") String dbSchema,
+      @PathParam("name") String name,
+      TableSpec spec,
+      @QueryParam("version") long version,


The version thing is cool. I'm a fan of this sort of thing in CRUD APIs.

Killed three birds with one stone: an update time, and a free way to do optimistic locking for those who are into such things. Also helps keep the remote cache in sync. It will be foiled by those who make more than one change per ms, but I suspect that will happen rarely. If it does, the cheat is to sleep for 1ms to bump the number. To be bullet-proof, there needs to be some prevention of moving backwards if the auto clock sync decides our system clock is moving fast and sets it back. We'll fine-tune that later, once the basics are seen to work.

gianm · 2022-06-28T21:20:17Z

server/src/main/java/org/apache/druid/catalog/DatasourceSpec.java

+      @JsonProperty("enableAutoCompaction") boolean enableAutoCompaction,
+      @JsonProperty("autoCompactionDelay") String autoCompactionDelay,
+      @JsonProperty("properties") Map<String, Object> properties,
+      @JsonProperty("columns") List<DatasourceColumnSpec> columns


Is this column list meant to be authoritative, or partial?

Partial: only those for which the user wants to provide info beyond what Druid already knows. Examples:

Add a new column, not yet in any datasource, to use in ingestion.

Column is ingested with multiple types, pick one as the preferred type.

Column exists, but no longer needed. Mark it as hidden.

Add a comment to explain the column.

paul-rogers · 2022-06-30T02:33:04Z

@gianm, thanks for the review. Your many questions illustrate why I pulled this PR back. On the one hand I want early feedback before I build on this foundation (so thanks for the comments!), but, on the other hand, until later work is done, objects and properties are preliminary placeholders and subject to change. So, it might be a bit early to promote the work to prime time.

clintropolis added Design Review Area - SQL labels Jun 14, 2022

paul-rogers added 6 commits June 16, 2022 16:22

Foundation for the Druid metadata catalog

5ae0134

Provides the DB and REST layer, but not yet the integration with the Calcite SQL layer.

Build fixes

cdc339b

Build fixes

04ec414

And there was much renaming

50c267b

More renaming & build fixes

d18a512

Fix warning

27614b9

paul-rogers force-pushed the 220613-catalog branch from 6374b80 to 27614b9 Compare June 16, 2022 23:24

paul-rogers added 2 commits June 17, 2022 18:05

Build fix

79bba46

IT fix

551b901

paul-rogers closed this Jun 27, 2022

gianm reviewed Jun 28, 2022

View reviewed changes

paul-rogers mentioned this pull request Oct 1, 2022

Druid Catalog basics #13165

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Foundation for the Druid metadata catalog #12647

Foundation for the Druid metadata catalog #12647

paul-rogers commented Jun 14, 2022

lgtm-com bot commented Jun 14, 2022

paul-rogers commented Jun 27, 2022

gianm left a comment

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

gianm Jun 28, 2022

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

gianm Jun 28, 2022

paul-rogers Jun 30, 2022

paul-rogers commented Jun 30, 2022

Foundation for the Druid metadata catalog #12647

Foundation for the Druid metadata catalog #12647

Conversation

paul-rogers commented Jun 14, 2022

lgtm-com bot commented Jun 14, 2022

paul-rogers commented Jun 27, 2022

gianm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-rogers commented Jun 30, 2022