-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPC-DS SF-1K Q72 extremely slow in Iceberg #22148
Comments
In my experiments, q72's running time would be greatly shorten after executing |
The Iceberg plan is very different than the Hive one. The Hive one was a bushy tree, while the Iceberg one is a left deep tree. This may result in the way the stats is used is different? @yzhang1991 Are you sure the Iceberg tables stats were collected? |
@yingsu00 Yes, but I found the stats we got from analyzing the hive tables and the iceberg tables are different. See the two screenshots below showing the stats for the same table in the same scale factor but one in the hive catalog one in the iceberg catalog: |
I think the issue here is with the setup - the iceberg connector employs a For this, the relevant code path for statistics load, merges the stats from HMS and Iceberg using the By changing this property to
We will need to do this (set flag, re-analyze) for all tables in the schema |
Re: support for NDVs in Iceberg+Hive. By default the When reading the statistics from an Iceberg table that uses Hive, statistics can come from the table itself or from the Hive metastore. By default, we only use statistics from the Iceberg table when reading because values like min/max and null counts are supposed to be more accurate as the manifests should always accurately reflect the values. However, NDVs and null counts are not always necessarily the as accurate from the iceberg table, so we provide the option to override the stats coming from the iceberg table with stats from the HMS. To override the behavior when retrieving stats you can set the property connector session property You can set it to the following values:
Since the HMS should already have the stats from the analyze before, you shouldn't need to re-analyze the tables. |
Adding on to this investigation. I confirmed that after analyzing on the Iceberg table and running It seems when this issue was filed the statistics for the iceberg versions of the tables were also not yet in the HMS. So |
@ZacBlanco quick unrelated question: when would null fractions ever be inaccurate when they come from Iceberg? From my understanding of the spec, manifest files store the absolute count of nulls, so shouldn't the fraction be mergeable and hence completely accurate solely from the manifest file information? |
Confirmed that this issue goes away after applying the session property in run |
@ZacBlanco Can we test this hypothesis? I recall I analyzed a table, set the right feature flag, but could not get the NDVs to show up without doing a re-analyze. If we can add this as a integration test, that would be ideal |
On Java clusters
Java Run Comparision Dashboard, Iceberg vs. Hive
It took 1.04 hours to finish on the Iceberg schema and only 2 mins on the Hive schema.
Query Comparison Dashboard, Iceberg vs. Hive
On Prestissimo clusters
Prestissimo Run Comparision Dashboard, Iceberg vs. Hive
We killed the query on Prestissimo Iceberg schema after 6 hours. On Prestissimo Hive it only took 1.31 mins.
Query Comparison Dashboard, Iceberg vs. Hive
The text was updated successfully, but these errors were encountered: