Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iceberg] Add manifest file caching for HMS-based deployments #24481

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Feb 3, 2025

Description

Adds manifest file caching to the Iceberg connector for HMS-based deployments.

Motivation and Context

In order to optimize and plan iceberg queries we call the planFiles() API multiple times throughout the query optimization lifecycle. Each time it requires reading and parsing metadata files which usually exist on an external filesystem such as S3. For large tables there could be hundreds of files. They usually range in a few kilobytes in size up to a few megabytes. When not cached in memory within Presto it can lead to significant E2E query latency degradation.

Impact

TBD

Test Plan

TBD

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
* Added manifest file caching for deployments which use the hive metastore.

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Feb 3, 2025
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-manifest-caching branch from 7db7896 to 666d248 Compare February 5, 2025 16:55
@ZacBlanco ZacBlanco marked this pull request as ready for review February 5, 2025 18:09
@ZacBlanco ZacBlanco requested review from hantangwangd and a team as code owners February 5, 2025 18:09
@ZacBlanco ZacBlanco requested a review from jaystarshot February 5, 2025 18:09
Copy link
Member

@jaystarshot jaystarshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I may not have the correct context on this but is it possible to add some tests too?

@@ -67,7 +67,7 @@ public class IcebergConfig

private EnumSet<ColumnStatisticType> hiveStatisticsMergeFlags = EnumSet.noneOf(ColumnStatisticType.class);
private String fileIOImpl = HadoopFileIO.class.getName();
private boolean manifestCachingEnabled;
private boolean manifestCachingEnabled = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended?

Copy link
Contributor Author

@ZacBlanco ZacBlanco Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is intentional. Performance is significantly worse with it disabled, and I don't think there are any known downsides to making this enabled by default other than an increased memory footprint

public ManifestFileCache createManifestFileCache(IcebergConfig config, MBeanExporter exporter)
{
Cache<ManifestFileCacheKey, ManifestFileCachedContent> delegate = CacheBuilder.newBuilder()
.maximumWeight(config.getManifestCachingEnabled() ? config.getMaxManifestCacheSize() : 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the caching is disabled i think we should not have any caching instead of adding this via 0 weight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants