-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns #971
Conversation
…ng encoding columns In old code, when dictionary string encoding columns are read by vectorized reading, 2 copy of current stripe's dictionary data and 1 copy of next stripe's dictionary data are hold in memory when reading across different stripes. That could make vectorized reading's memory usage is larger than row reading. This patch fixes this issue, and only hold 1 copy of current stripe's dictionary data. This patch logic has 3 parts: 1) Directly read data to primitive byte array, rather than using DynamicByteArray as intermediate variable. Using DynamicByteArray as intermediate variable causes 2 copy of current stripe's dictionary data are hold in memory. 2) Lazy read dictionary data until read current batch data. In previous code, RecordReaderImpl class's nextBatch method reads dictionary data of next stripe through advanceToNextRow method, then memory will hold two stripe's dictionary data. Through lazy read logic, only one stripe's dictionary data is hold in memory when reading across different stripes. 3) Before lazy read dictionary data from current stripe, remove batch data's reference to dictionary data from previous stripe. This could allow GC to clean previous stripe's dictionary data memory.
see background info here: https://issues.apache.org/jira/browse/ORC-1060 |
Thank you for making a PR, @expxiaoli . |
cc @pgaref |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any potential regression in terms of the speed?
@dongjoon-hyun In my perf test, speed is no regression for this patch. Here is scan time result for spark run query "insert overwrite table res_table select mapping['tag_a'] from src_table" "scan time" is spark metric in FileSourceScanExec class's doExecuteColumnar method , which only metric time for scan operator executor-memory | new ORC with this patch | old ORC Besides, this patch removes memory allocation for DynamicByteArray as well as memory copy from DynamicByteArray to primitive byte array, and do not add other time consuming logic. I think there is no potential regression for speed. |
Before
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master. Thank you, @expxiaoli and @expxiaoli .
Merry Christmas!
@expxiaoli . I added you to the Apache ORC contributor group and assigned ORC-1060 to you. |
…ng encoding columns (#971) ### What changes were proposed in this pull request? In old code, when dictionary string encoding columns are read by vectorized reading, 2 copy of current stripe's dictionary data and 1 copy of next stripe's dictionary data are hold in memory when reading across different stripes. That could make vectorized reading's memory usage is larger than row reading. This patch fixes this issue, and only hold 1 copy of current stripe's dictionary data. This patch logic has 3 parts: 1) Directly read data to primitive byte array, rather than using DynamicByteArray as intermediate variable. Using DynamicByteArray as intermediate variable causes 2 copy of current stripe's dictionary data are hold in memory. 2) Lazy read dictionary data until read current batch data. In previous code, RecordReaderImpl class's nextBatch method reads dictionary data of next stripe through advanceToNextRow method, then memory will hold two stripe's dictionary data. Through lazy read logic, only one stripe's dictionary data is hold in memory when reading across different stripes. 3) Before lazy read dictionary data from current stripe, remove batch data's reference to dictionary data from previous stripe. This could allow GC to clean previous stripe's dictionary data memory. ### Why are the changes needed? Reduce memory usage. ### How was this patch tested? Pass the existing CIs. (cherry picked from commit 3a2cb60) Signed-off-by: Dongjoon Hyun <[email protected]>
I cherry-picked this to branch-1.7 for next Apache ORC 1.7.3 release. |
Thanks @dongjoon-hyun |
What changes were proposed in this pull request?
In old code, when dictionary string encoding columns are read by vectorized reading, 2 copy of current stripe's dictionary data and 1 copy of next stripe's dictionary data are hold in memory when reading across different stripes. That could make vectorized reading's memory usage is larger than row reading. This patch fixes this issue, and only hold 1 copy of current stripe's dictionary data.
This patch logic has 3 parts:
Directly read data to primitive byte array, rather than using DynamicByteArray as intermediate variable. Using DynamicByteArray as intermediate variable causes 2 copy of current stripe's dictionary data are hold in memory.
Lazy read dictionary data until read current batch data. In previous code, RecordReaderImpl class's nextBatch method reads dictionary data of next stripe through advanceToNextRow method, then memory will hold two stripe's dictionary data. Through lazy read logic, only one stripe's dictionary data is hold in memory when reading across different stripes.
Before lazy read dictionary data from current stripe, remove batch data's reference to dictionary data from previous stripe. This could allow GC to clean previous stripe's dictionary data memory.
Why are the changes needed?
Reduce memory usage.
How was this patch tested?
Pass the existing CIs.