Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] Write NDVs to Iceberg table format #20993

Merged
merged 1 commit into from
Nov 15, 2023

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Sep 28, 2023

Description

This PR introduces the changes required to read+write the distinct value count statistics as described by Iceberg's Puffin file specification 1.

The change can be broken down into three main parts.

  1. Updates to the SPI to allow connectors to define the function used to calculate a specific statistic.
  2. The addition of 3 new functions: sketch_theta, sketch_theta_estimate, and sketch_theta_summary.
  3. Plumbing and implementation in the Iceberg connector to support reading+writing of the NDVs

SPI Changes

The SPI changes are minimal, but it involves adding a new field to the ColumnStatisticMetadata class in order to
allow connectors to override the functions used during ANALYZE. Previously, the functions for each ColumnStatisticType were defined statically in a switch block inside of presto-main. The logic in the switch block
checked the statistic type and then mapped that to a function which executed.

Instead, this PR introduces a change where each statistic has a default function of which the name is associated
in the ColumnStatisticType enum. However, the change here adds a field in the ColumnStatisticMetadata for a
function name. For most connectors that implement analyze, they can easily generate a ColumnStatisticMetadata
instance using the default function, however, we will now also allow a connector to create a ColumnStatisticMetadata
instance which has a custom function name that can be defined by the connector. That function is then eventually
resolved by the coordinator and executed during the ANALYZE, and the result of the function is returned to the
connector in the finishStatisticsCollection function as a Block type which can be read by connector-defined
logic. This design keeps all of the logic for defining and parsing the result up to the connector, so no additional changes
need to be made to the ANALYZE design other than a minor refactor.

New Sketch Functions

Iceberg's Puffin spec defines the format that NDVs must be written in. Currently, the only available format is a binary
blob representing an Apache Datasketches Theta Sketch 2. I've added Apache Datasketches as a presto dependency
and implemented three basic functions which expose the sketch so that Iceberg can eventually consume it when writing
statistics.

A brief note on naming: I presume in the future we will need support for KLL sketches from the same library, as the latest Hive 4.0 beta release defines its histograms format as a KLL sketch, so to make it easier to find the "sketch" families of function when sorted alphabetically, I've decided to use the naming convention sketch_*

  1. sketch_theta(<column>) -> varbinary: An aggregation function which accepts a column and generates a binary representation of the org.apache.datasketches.theta.CompactSketch. Applications can easily consume this raw binary
    format to gain access to a CompactSketch instance and associated methods.
  2. sketch_theta_estimate(<varbinary sketch>) -> double: A scalar function which consumes a raw binary sketch and produces the estimate. This is effectively the same as calling CompactSketch::getEstimate. I've exposed this as a convenience for checking the sketch's output
  3. sketch_theta_summary(<varbinary sketch>) -> row(estimate double, theta double, upper_bound_std1 double, lower_bound_std1 double, retained_entries int): This is another scalar function, but returns a row type containing
    more human-readable information about the sketch such as the theta parameter as well as upper and lower bounds
    for 1 standard deviation from the estimate

Iceberg Read/Write Support

Lastly, this change includes the additional plumbing required to read and write NDVs within the Iceberg connector's
TableStatisticsMaker class.

On the statistics write path inside of the IcebergAbstractMetadata::finishStatisticsCollection
We add a call to writeTableStatistics. This will take all of the collected statistics data and write out the theta
sketches collected into blobs within a puffin file, then the table metadata is updated with a pointer to the puffin file.

On the read path, we now check for the most up-to-date statistics file, pick that file, and read all of the blobs' metadata.
For each blob we check the "ndv" property (see the puffin spec1) and associated fieldId, and add it to a map of field -> ColumnStatistics.
The code is structured in a way that should make it easy to read/look for more statistics in the file as they are added
to the Puffin spec. After reading the statistic file data, the resulting map of column statistics is used as a "base" which
the additional existing statistics can then be added to, such as the min, max, etc.

Motivation and Context

In order for Presto's optimizer to do its job more effectively, statistics are useful to determine the resulting size of joins or predicate filters. By having them available, the optimizer can make more informed decisions.

Impact

  • 3 new functions available in the Presto default function namespace
  • ANALYZE available to the Iceberg regardless of metastore

Test Plan

  • existing tests should cover SPI changes.
  • Tests added for new sketch functions
  • Tests added for ANALYZE and getTableStatistics

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* The Iceberg connector will now read+write distinct value counts as defined by the spec.
* Functions to compute Theta sketches from the Apache DataSketches library are now available

Footnotes

  1. https://iceberg.apache.org/puffin-spec/ 2

  2. https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch 2 times, most recently from aa1d1e6 to 3187bc4 Compare September 29, 2023 05:40
@ZacBlanco ZacBlanco marked this pull request as ready for review September 29, 2023 05:40
@ZacBlanco ZacBlanco requested a review from a team as a code owner September 29, 2023 05:40
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch from 3187bc4 to 099d0f4 Compare September 29, 2023 05:57
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch from 099d0f4 to aa91d9d Compare September 29, 2023 23:13
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch 3 times, most recently from 94866b8 to 3f480a3 Compare October 9, 2023 17:12
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch from 3f480a3 to edad688 Compare October 11, 2023 21:31
@ZacBlanco ZacBlanco added the iceberg Apache Iceberg related label Oct 13, 2023
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch from edad688 to 96b1890 Compare October 13, 2023 19:45
Copy link
Contributor

@ClarenceThreepwood ClarenceThreepwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Mostly nits + more tests would be good


public static void input(Type type, SketchAggregationState<Union> state, Block block, int position)
{
if (block.isNull(position)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may need to override isCalledOnNullInput

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default of isCalledOnNullInput is false, but I put this here as a defensive measure. I wasn't aware of any contract in the calling convention. I guess if isCalledOnNullInput takes care of it then I can probably remove it.

Copy link
Contributor

@ClarenceThreepwood ClarenceThreepwood Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contract for isCalledOnNullInput is described here. It doesn't change the behavior of the function, but is instead a directive that is used by the optimizer for certain rewrites

presto> select sketch_theta(cast(null as varbinary));
_col0


01 03 03 00 00 1e 00 00
(1 row)

presto> select sketch_theta(null);
_col0


01 03 03 00 00 1e 00 00
(1 row)

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch from 96b1890 to 29757cf Compare October 16, 2023 18:14
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Just some small nits.

Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to review the tests; have taken a look on the wireup E2E and the theta sketch functions

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch 6 times, most recently from cc1ed2d to b98c58d Compare October 30, 2023 20:46
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Only one small problem in test case.

@tdcmeehan tdcmeehan self-assigned this Oct 31, 2023
@ZacBlanco
Copy link
Contributor Author

@steveburnett can you take a look at the additional docs I added in f83256c54dd811827fb30226b23f95eb585a705c? Thanks!

Copy link

github-actions bot commented Oct 31, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 541d7ec...3f5ee7f.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/functions.rst
presto-docs/src/main/sphinx/functions/sketch.rst

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch from 4345aa7 to 4b94fb3 Compare October 31, 2023 17:03
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the job, it's a very useful feature.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch from 4b94fb3 to 37a8e5c Compare October 31, 2023 18:06
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on the docs! Only one small nit.

presto-docs/src/main/sphinx/functions/sketch.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

New local build shows the updated doc. Thanks!

This PR introduces the changes required to read+write the
distinct value count statistics as described by Iceberg's Puffin
file specification[[1]].

The change can be broken down into three main parts.

    - Updates to the SPI to allow connectors to define the
    function used to calculate a specific statistic.
    - The addition of 3 new functions: sketch_theta,
    sketch_theta_estimate, and sketch_theta_summary.
    - Plumbing and implementation in the Iceberg connector to
    support reading and writing of the NDVs

[1]: https://iceberg.apache.org/puffin-spec/
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-ndv-write branch from 99f529a to 3f5ee7f Compare November 3, 2023 22:20
@liushengxuan
Copy link

@beinan Thanks for inviting!!

LGTM. Thanks for the contribution!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iceberg Apache Iceberg related
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants