coord: major cleanup #3982

benesch · 2020-08-19T07:27:44Z

This series of refactoring commits tries to get a handle on the growing complexity in coord, in order to facilitate pg_catalog work (#913, MaterializeInc/database-issues#748). I've done my best to split things into small, meaningful commits with good commit messages, but let me know if anything is unclear.

This change is

We were duplicating the dependency information in both the coordinator metadata and the catalog. This added complexity and the possibility of the metadata getting out of sync. Rework so that the coordinator just uses the view dependency information from the catalog, rather than duplicating it. Note that the comment that said "Only views, not sources, on which the views depends" appears to have been entirely wrong.

Whether a view is "queryable" or not derives from whether it is possible to find indexes for all of the view's inputs. We were previously denormalizing this information in the coordinator and keeping it up to date by manually propagating querability when new indexes were created. There were two big downsides to this approach: * The logic was a bit brittle and error prone. Modifications to the coordinator could easily result in missetting the queryability state, which caused hard to track down bugs. * The metadata only existed in the coordinator, and so `SHOW VIEWS` and `SHOW SOURCES` had to be special-cased in order to be computed in the coordinator. Besides being a bit annoying, this was about to impede #913 (mirroring `SHOW` data as views), and prevented using `SHOW {SOURCES,VIEWS} ... WHERE`. The new approach simply computes the queryability property on demand using information that is available in the catalog, resolving both of the above downsides. One potential downside of the new approach is that `SHOW VIEWS` has to do more work to compute queryability, but the performance of `SHOW VIEWS` is not particularly important. The hot path (`SELECT`) is probably unaffected by this patch, since computing queryability is done concurrently with finding the nearest indexes for the view.

The information contained within ViewState is easily derived from the catalog. Removing it avoids a lot of superfluous bookkeeping.

Computing whether a view depends on a table is easy enough to do on the fly, and is also more correct, because it properly captures transitive dependencies.

Doesn't seem like there is any good reason to have these separate.

This cuts down on the size of coord.rs, which is still too large.

This relieves some more pressure on coord.rs.

There are no behavioral changes in this commit. The changes include: * Using the "handle_XXX" terminology consistently to refer to the top-level handler function for a coordinator command. * Colocating all the hander functions near the top of the coordinator impl block. * Removing unnecessary `pub` designations from a number of methods/functions, and removing some resulting dead code.

Reconfiguring the APIs for building and shipping DataflowDescs results in much simpler call sites, and hopefully easier-to-understand mutation patterns. In the new API: * `build...dataflow` methods do not mutate any coordinator state, and return a DataflowDesc. * `import...into_dataflow` methods do not mutate any coordinator state, but modify the provided `DataflowDesc`. * neither the build nor import APIs take `catalog::{Item,View}`, as these catalog types are unnecessarily hard to construct for transient dataflows. * the `ship_dataflow` method is the only method that mutates coordinator state based on the provided `DataflowDesc`. It also broadcasts that `DataflowDesc` to the dataflow workers. This isn't perfect, but it should make things much easier to reason about, because the construction of the dataflow description is better separated from updating coordinator state.

JLDLaughlin · 2020-08-19T14:33:31Z

src/coord/src/catalog.rs

+                    *complete = false;
+                }
+                CatalogItem::Table(_) => {
+                    unreachable!("tables always have at least one index");


Yep, totally! (See #3905.) I think we just need to disable dropping the default index on tables, since it's really unusual that you could drop an index on a table and accidentally delete all the data in it.

That makes sense! In the meantime, this isn't actually unreachable!() though. Should we return an error instead and put a todo? Or are we okay with this panic?

I'm ok with it for just a moment since a) tables are experimental, and b) I'll have a PR out by end of day! Thanks for raising, though! I won't forget!

JLDLaughlin · 2020-08-19T14:38:54Z

src/coord/src/coord.rs

@@ -116,8 +116,6 @@ where
    optimizer: Optimizer,
    catalog: Catalog,
    symbiosis: Option<symbiosis::Postgres>,
-    /// Maps (global Id of view) -> (existing indexes)
-    views: HashMap<GlobalId, ViewState>,


JLDLaughlin · 2020-08-19T14:41:26Z

src/coord/src/coord.rs

@@ -1356,7 +1344,7 @@ where
                typ: _,
            } = source.as_ref()
            {
-                if let Some(Some((index_id, _))) = self.views.get(&id).map(|v| &v.default_idx) {
+                if let Some((index_id, _)) = self.catalog.indexes()[&id].first() {


Why first() here?

That's just the convention for the "default" index. But you're right that that's very unclear. I added a Catalog::default_index_for method that makes things read more nicely.

ruchirK · 2020-08-19T16:24:19Z

I skimmed through these changes but everything looks great to me!

ruchirK · 2020-08-19T16:25:14Z

I definitely appreciated the thoughtful commits and commit messages btw :D

wangandi

👍

Arrangement state going into its own file and queryability being in the catalog has been long overdue.

wangandi · 2020-08-19T17:08:32Z

src/coord/src/coord.rs

        if dataflow.objects_to_build.iter().any(|bd| &bd.id == id)
            || dataflow.source_imports.iter().any(|(i, _)| i == id)
        {
            return;
        }
-        if let Some((index_id, keys)) = &self.views[id].default_idx {
+        // A valid index is any index on `id` that is known to the dataflow
+        // layer, as indicated by its presence in `self.indexes`.s


Nit: extra s at end of sentence

Thanks, will fix in a follow up to avoid going through CI again.

Follow up from MaterializeInc#3982.

benesch added 6 commits August 18, 2020 15:41

coord: remove ViewState

f4d240f

The information contained within ViewState is easily derived from the catalog. Removing it avoids a lot of superfluous bookkeeping.

coord: don't track table indexes

1ef9e8b

Computing whether a view depends on a table is easy enough to do on the fly, and is also more correct, because it properly captures transitive dependencies.

coord: combine dataflow validation and broadcasting

623e063

Doesn't seem like there is any good reason to have these separate.

coord: split arrangement_state mod into dedicated file

697f970

This cuts down on the size of coord.rs, which is still too large.

benesch requested review from ruchirK and wangandi August 19, 2020 07:27

benesch mentioned this pull request Aug 19, 2020

coord: don't track view uses separately from catalog #3979

Closed

benesch added 3 commits August 19, 2020 03:31

coord: move dump_catalog function to catalog mod

5588d70

This relieves some more pressure on coord.rs.

benesch force-pushed the coord-refactor branch from 2d7ea9b to 8dd1ca8 Compare August 19, 2020 07:32

JLDLaughlin reviewed Aug 19, 2020

View reviewed changes

ruchirK approved these changes Aug 19, 2020

View reviewed changes

wangandi approved these changes Aug 19, 2020

View reviewed changes

benesch merged commit 3623229 into MaterializeInc:main Aug 19, 2020

benesch deleted the coord-refactor branch August 19, 2020 19:57

benesch added a commit to benesch/materialize that referenced this pull request Aug 19, 2020

coord: fix typo in comment

082d818

Follow up from MaterializeInc#3982.

benesch mentioned this pull request Aug 19, 2020

coord: fix typo in comment #3990

Merged

benesch added a commit to benesch/materialize that referenced this pull request Aug 19, 2020

coord: make default index assumptions more obvious

f4e5577

Follow up from MaterializeInc#3982.

benesch mentioned this pull request Aug 19, 2020

coord,sql: fix some table buglets #3992

Merged

benesch added a commit to benesch/materialize that referenced this pull request Aug 20, 2020

coord: make default index assumptions more obvious

e414314

Follow up from MaterializeInc#3982.

benesch added a commit to benesch/materialize that referenced this pull request Aug 20, 2020

coord: make default index assumptions more obvious

76ea970

Follow up from MaterializeInc#3982.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coord: major cleanup #3982

coord: major cleanup #3982

benesch commented Aug 19, 2020 •

edited

Loading

JLDLaughlin Aug 19, 2020

benesch Aug 19, 2020

JLDLaughlin Aug 19, 2020

benesch Aug 19, 2020

JLDLaughlin Aug 19, 2020

JLDLaughlin Aug 19, 2020

benesch Aug 19, 2020

ruchirK commented Aug 19, 2020

ruchirK commented Aug 19, 2020

wangandi left a comment

wangandi Aug 19, 2020

benesch Aug 19, 2020

coord: major cleanup #3982

coord: major cleanup #3982

Conversation

benesch commented Aug 19, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruchirK commented Aug 19, 2020

ruchirK commented Aug 19, 2020

wangandi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benesch commented Aug 19, 2020 •

edited

Loading