Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coord: don't track view uses separately from catalog #3979

Closed
wants to merge 9 commits into from

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Aug 18, 2020

We were duplicating the dependency information in both the coordinator
metadata and the catalog. This added complexity and the possibility of
the metadata getting out of sync. Rework so that the coordinator just
uses the view dependency information from the catalog, rather than
duplicating it.

Note that the comment that said "Only views, not sources, on which the
views depends" appears to have been entirely wrong.


This change is Reviewable

We were duplicating the dependency information in both the coordinator
metadata and the catalog. This added complexity and the possibility of
the metadata getting out of sync. Rework so that the coordinator just
uses the view dependency information from the catalog, rather than
duplicating it.

Note that the comment that said "Only views, not sources, on which the
views depends" appears to have been entirely wrong.
@benesch benesch force-pushed the view-uses branch 2 times, most recently from 415699e to 90dbd46 Compare August 19, 2020 04:38
Whether a view is "queryable" or not derives from whether it is possible
to find indexes for all of the view's inputs. We were previously
denormalizing this information in the coordinator and keeping it up to
date by manually propagating querability when new indexes were created.

There were two big downsides to this approach:

  * The logic was a bit brittle and error prone. Modifications to the
    coordinator could easily result in missetting the queryability state,
    which caused hard to track down bugs.

  * The metadata only existed in the coordinator, and so `SHOW VIEWS`
    and `SHOW SOURCES` had to be special-cased in order to be computed
    in the coordinator. Besides being a bit annoying, this was about to
    impede #913 (mirroring `SHOW` data as views), and prevented using
    `SHOW {SOURCES,VIEWS} ... WHERE`.

The new approach simply computes the queryability property on demand
using information that is available in the catalog, resolving both of
the above downsides.

One potential downside of the new approach is that `SHOW VIEWS` has to
do more work to compute queryability, but the performance of `SHOW
VIEWS` is not particularly important. The hot path (`SELECT`) is
probably unaffected by this patch, since computing queryability is done
concurrently with finding the nearest indexes for the view.
The information contained within ViewState is easily derived from the
catalog. Removing it avoids a lot of superfluous bookkeeping.
Computing whether a view depends on a table is easy enough to do on the
fly, and is also more correct, because it properly captures transitive
dependencies.
Doesn't seem like there is any good reason to have these separate.
This cuts down on the size of coord.rs, which is still too large.
This relieves some more pressure on coord.rs.
There are no behavioral changes in this commit. The changes include:

  * Using the "handle_XXX" terminology consistently to refer to the
    top-level handler function for a coordinator command.

  * Colocating all the hander functions near the top of the coordinator
    impl block.

  * Removing unnecessary `pub` designations from a number of
    methods/functions, and removing some resulting dead code.
Reconfiguring the APIs for building and shipping DataflowDescs results
in much simpler call sites, and hopefully easier-to-understand mutation
patterns. In the new API:

  * `build...dataflow` methods do not mutate any coordinator state,
    and return a DataflowDesc.
  * `import...into_dataflow` methods do not mutate any coordinator
    state, but modify the provided `DataflowDesc`.
  * neither the build nor import APIs take `catalog::{Item,View}`, as
    these catalog types are unnecessarily hard to construct for
    transient dataflows.
  * the `ship_dataflow` method is the only method that mutates
    coordinator state based on the provided `DataflowDesc`. It also
    broadcasts that `DataflowDesc` to the dataflow workers.

This isn't perfect, but it should make things much easier to reason
about, because the construction of the dataflow description is better
separated from updating coordinator state.
@benesch
Copy link
Contributor Author

benesch commented Aug 19, 2020

Superseded by #3982.

@benesch benesch closed this Aug 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant