Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make max slice size in ORC slice reader configurable #24202

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

sdruzkin
Copy link
Collaborator

@sdruzkin sdruzkin commented Dec 5, 2024

Description

ORC slice reader has a hardcoded max slice size of 1GB, it throws when it one attempts to read a slice(s) larger than 1GB. Make it configurable to be able to increase the threshold in Spark for some failing jobs.

Plumbed the new value trough OrcReaderOptions -> OrcRecordReaderOptions -> OrcReader -> SelectiveReaderContext -> SliceDirectSelectiveStreamReader.

Motivation and Context

Hardcoded value of 1GB is too low for some files and needs to be increased to accommodate such cases.

Impact

No impact.

Test Plan

Existing and new unit tests.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== NO RELEASE NOTE ==

@sdruzkin sdruzkin requested a review from a team as a code owner December 5, 2024 04:23
@sdruzkin sdruzkin requested a review from presto-oss December 5, 2024 04:23
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D66800897

sdruzkin added a commit to sdruzkin/presto that referenced this pull request Dec 5, 2024
Summary:

Make max slice size in ORC slice reader configurable to be able to increase the threshold in Spark for Data Mine failing jobs.

Differential Revision: D66800897
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D66800897

Summary:

Make max slice size in ORC slice reader configurable to be able to increase the threshold in Spark for Data Mine failing jobs.

Differential Revision: D66800897
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D66800897

@rschlussel
Copy link
Contributor

why do we have this limit in the first place? Is the problem that we aren't reserving the memory for the slice before we read?

@steveburnett
Copy link
Contributor

If this is a new configuration (or session) property, please add documentation in the appropriate pages of the Presto doc:

Presto Session Properties
Presto Configuration Properties
Presto C++ Session Properties
Presto C++ Configuration Properties

@sdruzkin
Copy link
Collaborator Author

sdruzkin commented Dec 6, 2024

why do we have this limit in the first place? Is the problem that we aren't reserving the memory for the slice before we read?

I guess so, ORC memory context does not do any good with memory reservation. It was added somewhere before 2021-2022, when a typical cluster had under 30GB of total memory and a very little headroom.

If this is a new configuration (or session) property, please add documentation in the appropriate pages of the Presto doc:

This setting cannot be configured through the session or cluster properties.

@sdruzkin sdruzkin merged commit 7f1bae2 into prestodb:master Dec 6, 2024
58 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants