Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coord: major cleanup #3982

Merged
merged 9 commits into from
Aug 19, 2020
Merged

coord: major cleanup #3982

merged 9 commits into from
Aug 19, 2020

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Aug 19, 2020

This series of refactoring commits tries to get a handle on the growing complexity in coord, in order to facilitate pg_catalog work (#913, MaterializeInc/database-issues#748). I've done my best to split things into small, meaningful commits with good commit messages, but let me know if anything is unclear.


This change is Reviewable

We were duplicating the dependency information in both the coordinator
metadata and the catalog. This added complexity and the possibility of
the metadata getting out of sync. Rework so that the coordinator just
uses the view dependency information from the catalog, rather than
duplicating it.

Note that the comment that said "Only views, not sources, on which the
views depends" appears to have been entirely wrong.
Whether a view is "queryable" or not derives from whether it is possible
to find indexes for all of the view's inputs. We were previously
denormalizing this information in the coordinator and keeping it up to
date by manually propagating querability when new indexes were created.

There were two big downsides to this approach:

  * The logic was a bit brittle and error prone. Modifications to the
    coordinator could easily result in missetting the queryability state,
    which caused hard to track down bugs.

  * The metadata only existed in the coordinator, and so `SHOW VIEWS`
    and `SHOW SOURCES` had to be special-cased in order to be computed
    in the coordinator. Besides being a bit annoying, this was about to
    impede #913 (mirroring `SHOW` data as views), and prevented using
    `SHOW {SOURCES,VIEWS} ... WHERE`.

The new approach simply computes the queryability property on demand
using information that is available in the catalog, resolving both of
the above downsides.

One potential downside of the new approach is that `SHOW VIEWS` has to
do more work to compute queryability, but the performance of `SHOW
VIEWS` is not particularly important. The hot path (`SELECT`) is
probably unaffected by this patch, since computing queryability is done
concurrently with finding the nearest indexes for the view.
The information contained within ViewState is easily derived from the
catalog. Removing it avoids a lot of superfluous bookkeeping.
Computing whether a view depends on a table is easy enough to do on the
fly, and is also more correct, because it properly captures transitive
dependencies.
Doesn't seem like there is any good reason to have these separate.
This cuts down on the size of coord.rs, which is still too large.
This relieves some more pressure on coord.rs.
There are no behavioral changes in this commit. The changes include:

  * Using the "handle_XXX" terminology consistently to refer to the
    top-level handler function for a coordinator command.

  * Colocating all the hander functions near the top of the coordinator
    impl block.

  * Removing unnecessary `pub` designations from a number of
    methods/functions, and removing some resulting dead code.
Reconfiguring the APIs for building and shipping DataflowDescs results
in much simpler call sites, and hopefully easier-to-understand mutation
patterns. In the new API:

  * `build...dataflow` methods do not mutate any coordinator state,
    and return a DataflowDesc.
  * `import...into_dataflow` methods do not mutate any coordinator
    state, but modify the provided `DataflowDesc`.
  * neither the build nor import APIs take `catalog::{Item,View}`, as
    these catalog types are unnecessarily hard to construct for
    transient dataflows.
  * the `ship_dataflow` method is the only method that mutates
    coordinator state based on the provided `DataflowDesc`. It also
    broadcasts that `DataflowDesc` to the dataflow workers.

This isn't perfect, but it should make things much easier to reason
about, because the construction of the dataflow description is better
separated from updating coordinator state.
*complete = false;
}
CatalogItem::Table(_) => {
unreachable!("tables always have at least one index");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone manually drops the index?

materialize=> CREATE TABLE t (a int, b text NOT NULL);
CREATE TABLE
materialize=> SHOW INDEXES FROM t;
    Source_or_view    |             Key_name             | Column_name | Expression | Null | Seq_in_index
----------------------+----------------------------------+-------------+------------+------+--------------
 materialize.public.t | materialize.public.t_primary_idx | a           |            | t    |            1
 materialize.public.t | materialize.public.t_primary_idx | b           |            | f    |            2
(2 rows)

materialize=> drop index materialize.public.t_primary_idx;
DROP INDEX
materialize=> SHOW INDEXES FROM t;
 Source_or_view | Key_name | Column_name | Expression | Null | Seq_in_index
----------------+----------+-------------+------------+------+--------------
(0 rows)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, totally! (See #3905.) I think we just need to disable dropping the default index on tables, since it's really unusual that you could drop an index on a table and accidentally delete all the data in it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! In the meantime, this isn't actually unreachable!() though. Should we return an error instead and put a todo? Or are we okay with this panic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with it for just a moment since a) tables are experimental, and b) I'll have a PR out by end of day! Thanks for raising, though! I won't forget!

@@ -116,8 +116,6 @@ where
optimizer: Optimizer,
catalog: Catalog,
symbiosis: Option<symbiosis::Postgres>,
/// Maps (global Id of view) -> (existing indexes)
views: HashMap<GlobalId, ViewState>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -1356,7 +1344,7 @@ where
typ: _,
} = source.as_ref()
{
if let Some(Some((index_id, _))) = self.views.get(&id).map(|v| &v.default_idx) {
if let Some((index_id, _)) = self.catalog.indexes()[&id].first() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why first() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just the convention for the "default" index. But you're right that that's very unclear. I added a Catalog::default_index_for method that makes things read more nicely.

@ruchirK
Copy link
Contributor

ruchirK commented Aug 19, 2020

I skimmed through these changes but everything looks great to me!

@ruchirK
Copy link
Contributor

ruchirK commented Aug 19, 2020

I definitely appreciated the thoughtful commits and commit messages btw :D

Copy link
Contributor

@wangandi wangandi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Arrangement state going into its own file and queryability being in the catalog has been long overdue.

if dataflow.objects_to_build.iter().any(|bd| &bd.id == id)
|| dataflow.source_imports.iter().any(|(i, _)| i == id)
{
return;
}
if let Some((index_id, keys)) = &self.views[id].default_idx {
// A valid index is any index on `id` that is known to the dataflow
// layer, as indicated by its presence in `self.indexes`.s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra s at end of sentence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will fix in a follow up to avoid going through CI again.

@benesch benesch merged commit 3623229 into MaterializeInc:main Aug 19, 2020
@benesch benesch deleted the coord-refactor branch August 19, 2020 19:57
benesch added a commit to benesch/materialize that referenced this pull request Aug 19, 2020
benesch added a commit to benesch/materialize that referenced this pull request Aug 19, 2020
benesch added a commit to benesch/materialize that referenced this pull request Aug 20, 2020
benesch added a commit to benesch/materialize that referenced this pull request Aug 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants