Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for row filtering #21913

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

JaneHang9305
Copy link

@JaneHang9305 JaneHang9305 commented Feb 13, 2024

Cherry-pick of trinodb/trino@7efb49c
Co-authored-by: Martin Traverso

Description

SystemAccessControl/AccessControl to provide filters/projections on the base table given an authorization identifier (user/role) that the query planner injects into the plan.

Motivation and Context

Presto should support row-level filtering and data masking for more granular access controls to data.
#20572
trinodb/trino#1480

Impact

There is no performance impact on users unless onboarding the rules.

Test Plan

Test on local.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* ...
* ...

Hive Changes
* ...
* ...

If release note is NOT required, use:

== NO RELEASE NOTE ==

@JaneHang9305 JaneHang9305 marked this pull request as ready for review February 13, 2024 03:23
@JaneHang9305 JaneHang9305 requested a review from a team as a code owner February 13, 2024 03:24
@jaystarshot
Copy link
Member

@JaneHang9305 You need to add a coauthor in the commit ref- https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#attribution.
@tdcmeehan @imjalpreet @agrawalreetika This has been a stale issue since a while now. Should we just discuss this trino approach in a doc/wg meeting and proceed with merging/path forward?

@tdcmeehan tdcmeehan self-assigned this Apr 9, 2024
@JaneHang9305 JaneHang9305 marked this pull request as draft April 10, 2024 20:37
@tdcmeehan
Copy link
Contributor

It seems like this filter is specified on a per-connector level. If that's the case, why not just use a connector rewrite to provide the filter?

@JaneHang9305
Copy link
Author

It seems like this filter is specified on a per-connector level. If that's the case, why not just use a connector rewrite to provide the filter?

This is a unified solution for all connectors, and each connector can have its own rules for filtering.

@tdcmeehan
Copy link
Contributor

Could we achieve a unified solution through a library, which provides a connector rewrite rule that the connector can add in to add support for row level filtering?

Originally I thought an approach like this might be a good idea, however I think while there are costs to the connector rewrite framework, it also has enormous benefits in the flexibility it allows and the ability to make changes that are just cumbersome to do purely in the engine. So I'm wondering, if the primary motivation is to use one unified approach, can this be accomplished through composition--connectors compose a library which adds a rewrite rule which provides a rule level filter? That way they can be shared among Hive, Hudi, Iceberg and other connectors easily.

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Apr 11, 2024

The same argument from a different angle: in what way is our existing SPI footprint insufficient to allow for connector-specified row level filtering? Row level filtering is listed as one of the motivations for connector rewrites--in what way does this improve on that? If it's a narrower SPI footprint, can this be achieved through composition as mentioned above? And does it really make sense to add additional SPI hooks if we already have this capability?

@jaystarshot
Copy link
Member

jaystarshot commented Apr 12, 2024

I do agree that it would be better if we have a smaller SPI footprint. However, I don't think the process of adding the filter
itself is connector specific (we could also have added a common plan rule directly in the planning layers) but rather the rule level filters that connectors need to provide which are connector specific. It's like ConnectorIdentity which is connector specific.
Furthermore, we already have a package for security com.facebook.presto.spi.security and it might be better to centralize all security-connector updates from one location.

@tdcmeehan
Copy link
Contributor

Let me know if I'm understanding this correctly.

For a table scan, during analysis, we we check the table we want to filter, and call an SPI hook on the Connector to ask for its row filters. Then we'll process and validate these filters, and put them back into the analysis. Then, during planning, we'll check if any row level filters were specified, and add them to as filters to the table scan node. Filtering accomplished.

It seems like one argument is this is the place where it ought to go, because it's security related, and security related things ought to be centralized in one place. But I would argue that connector plan rewrites are a conscious choice about splitting where things ought to go: certain things are global, and ought to be centralized in the engine, and certain things can only be inferred by the context of the connector, and can be pushed down entirely to the connector. An example is advanced filter pushdown: while this is common in Hive-like systems, other systems treat this all very differently, and there's not a whole lot of benefit for centralizing this in Presto. Likewise, because the rules which specify what rows to filter out are connector-specific, I just don't see the benefit to centralizing this and adding a larger SPI footprint when we already have the ability to do this entirely in the connector.

@jaystarshot
Copy link
Member

Yes that's correct, however this is not an advanced filter pushdown where we need any connector-specific info regarding the tableScans. I think this pushdown could be handled in presto-main itself. Replicating the pushdown across all connectors might simply lead to code duplication. Additionally, it might make it challenging for administrators to implement central access policies across all connectors/catalogs

@tdcmeehan
Copy link
Contributor

I don't understand how this PR solves the problem of making central access policies less challenging. The policies must come from the connector, correct?

@jaystarshot
Copy link
Member

jaystarshot commented Apr 12, 2024

Yes but they follow the same interface I guess? Maybe it doesn't make central access policies that easier, not sure. But since the filter logic is in presto-main now (lets say to avoid duplication) we would need something in the spi for the connector to communicate with the engine?

@jaystarshot
Copy link
Member

jaystarshot commented Apr 12, 2024

So I guess the question is whether this filter pushdown deserves its own connector specific optimizers or is this approach perfectly fine to avoid duplication

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Apr 15, 2024

Let's consider the approaches by analogy. By analogy, the approach in this PR is like object-oriented inheritance. There is base functionality, which is applying the filters, but there are abstract methods, which are the SPI hooks. Similar to object oriented inheritance, you must know how the parent class uses the abstract methods in order to implement them effectively. It seems similar by analogy to a parent class with base functionality, and abstract methods that child classes may implement to use this functionality. There is indeed code reuse here.

Let us consider the connector rewrite rule approach to be more like composition. In this approach, you need not know the parent interface rules. Instead, you compose a new behavior, which defines privacy filters based on arbitrary criteria, and pushes them to table scan nodes. This seems like composition--a set or list of objects that provide behaviors that can be mixed-in to the class. Code reuse can be achieved by making it so that behaviors that are mixed-in can themselves be subclassed or provide composition of behaviors of their own.

The outcome seems the same between both. Because we've embarked on the composition path, I am wary of doing this through the inheritance path, since we already have the ability to do this with composition. The code reuse argument seems to be able to be solved by just making the rewrite a library that connectors mix-in/compose to existing rewrites (which seems no less onerous than connectors needing to self-specify the filters that the engine should add to provide the row-level filtering).

@jaystarshot
Copy link
Member

Sounds good, yeah connector optimizers will be the better approach for presto. If they need to be "improved" like for eg: connector optimizers for relational plan, that functionality can be added i guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants