Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add range filtering support for iceberg ingestion #15782

Merged
merged 4 commits into from
Feb 2, 2024

Conversation

a2l007
Copy link
Contributor

@a2l007 a2l007 commented Jan 29, 2024

Description

Adds a new iceberg ingestion filter of type range which can be used to filter on ranges of column values. It supports lower and/or upper bounds with < , <= , > , and >=

Example spec:

               "icebergFilter": {
                  "type" : "range",
                  "filterColumn" : "price",
                  "upper": 50,
                  "lower": 45,
                  "lowerOpen": true
                  "upperOpen": false
                },

which is equivalent to 45 < price <= 50

This PR also fixes a dependency related issue which may be encountered while talking to Kerberos secured HDFS warehouses.

Release note

Adds a new iceberg ingestion filter of type range which can be used to filter on ranges of column values.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! I've left a few minor comments. Also, the docs static CI checks are failing.


|Property|Description|Required|
|--------|-----------|---------|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|--------|-----------|---------|
|--------|-----------|---------|

@JsonProperty("upperOpen") @Nullable Boolean upperOpen
)
{
Preconditions.checkNotNull(filterColumn, "You must specify a filter column on the range filter");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using DruidException for these validation checks:

Suggested change
Preconditions.checkNotNull(filterColumn, "You must specify a filter column on the range filter");
if (filterColumn == null) {
throw InvalidInput.exception("filterColumn cannot be null.");
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can refactor this in a separate PR since the other filters also need to start using DruidException

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, thanks!

)
{
Preconditions.checkNotNull(filterColumn, "You must specify a filter column on the range filter");
Preconditions.checkArgument(lower != null || upper != null, "Both lower and upper bounds cannot be empty");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Preconditions.checkArgument(lower != null || upper != null, "Both lower and upper bounds cannot be empty");
if (lower == null && upper == null) {
throw InvalidInput.exception("Both lower and upper bounds cannot be null.");
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@Override
public Expression getFilterExpression()
{
List<Expression> expressions = new ArrayList<>(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since one of lower and upper can be null

Suggested change
List<Expression> expressions = new ArrayList<>(2);
List<Expression> expressions = new ArrayList<>();

@JsonSubTypes.Type(name = "or", value = IcebergOrFilter.class)
@JsonSubTypes.Type(name = "or", value = IcebergOrFilter.class),
@JsonSubTypes.Type(name = "range", value = IcebergRangeFilter.class)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Preconditions.checkNotNull(filterColumn, "You must specify a filter column on the range filter");
Preconditions.checkArgument(lower != null || upper != null, "Both lower and upper bounds cannot be empty");
this.filterColumn = filterColumn;
this.lowerOpen = lowerOpen != null ? lowerOpen : false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we also add checks to validate if lowerOpen is specified if lower is null as it doesn't make sense? Ditto for upperOpen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be checked, but are you suggesting that the ingest should fail in that scenario?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think so. I will leave it up to you though if the stricter validation is required

|filterColumn|The column name from the iceberg table schema based on which range filtering needs to happen.|yes|
|lower|Lower bound value to match.|no. At least one of `lower` or `upper` must not be null.|
|upper|Upper bound value to match. |no. At least one of `lower` or `upper` must not be null.|
|lowerOpen|Boolean indicating if lower bound is open in the interval of values defined by the range (">" instead of ">="). |no|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also be good to specify the defaults explicitly via a separate column or next to the "no" in the required column

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after CI. Thanks, @a2l007!

@abhishekrb19 abhishekrb19 merged commit 2e46a98 into apache:master Feb 2, 2024
83 checks passed
@abhishekrb19 abhishekrb19 added this to the 29.0.0 milestone Feb 2, 2024
abhishekrb19 pushed a commit to abhishekrb19/incubator-druid that referenced this pull request Feb 2, 2024
* Add range filtering support for iceberg ingestion

* Docs formatting

* Spelling
abhishekrb19 pushed a commit that referenced this pull request Feb 2, 2024
* Add range filtering support for iceberg ingestion

* Docs formatting

* Spelling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants