Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns #971

Merged
merged 2 commits into from
Dec 25, 2021

Conversation

expxiaoli
Copy link
Contributor

@expxiaoli expxiaoli commented Dec 14, 2021

What changes were proposed in this pull request?

In old code, when dictionary string encoding columns are read by vectorized reading, 2 copy of current stripe's dictionary data and 1 copy of next stripe's dictionary data are hold in memory when reading across different stripes. That could make vectorized reading's memory usage is larger than row reading. This patch fixes this issue, and only hold 1 copy of current stripe's dictionary data.

This patch logic has 3 parts:

  1. Directly read data to primitive byte array, rather than using DynamicByteArray as intermediate variable. Using DynamicByteArray as intermediate variable causes 2 copy of current stripe's dictionary data are hold in memory.

  2. Lazy read dictionary data until read current batch data. In previous code, RecordReaderImpl class's nextBatch method reads dictionary data of next stripe through advanceToNextRow method, then memory will hold two stripe's dictionary data. Through lazy read logic, only one stripe's dictionary data is hold in memory when reading across different stripes.

  3. Before lazy read dictionary data from current stripe, remove batch data's reference to dictionary data from previous stripe. This could allow GC to clean previous stripe's dictionary data memory.

Why are the changes needed?

Reduce memory usage.

How was this patch tested?

Pass the existing CIs.

…ng encoding columns

In old code, when dictionary string encoding columns are read by vectorized reading, 2 copy of current stripe's dictionary data and 1 copy of next stripe's dictionary data are hold in memory when reading across different stripes. That could make vectorized reading's memory usage is larger than row reading. This patch fixes this issue, and only hold 1 copy of current stripe's dictionary data.

This patch logic has 3 parts:

1) Directly read data to primitive byte array, rather than using DynamicByteArray as intermediate variable. Using DynamicByteArray as intermediate variable causes 2 copy of current stripe's dictionary data are hold in memory.

2) Lazy read dictionary data until read current batch data. In previous code, RecordReaderImpl class's nextBatch method reads dictionary data of next stripe through advanceToNextRow method, then memory will hold two stripe's dictionary data. Through lazy read logic, only one stripe's dictionary data is hold in memory when reading across different stripes.

3) Before lazy read dictionary data from current stripe, remove batch data's reference to dictionary data from previous stripe. This could allow GC to clean previous stripe's dictionary data memory.
@github-actions github-actions bot added the JAVA label Dec 14, 2021
@expxiaoli expxiaoli changed the title ORC-1060: reduce memory usage when vectorized reading dictionary stri… ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns Dec 14, 2021
@expxiaoli
Copy link
Contributor Author

see background info here: https://issues.apache.org/jira/browse/ORC-1060

@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @expxiaoli .

@dongjoon-hyun
Copy link
Member

cc @pgaref

@expxiaoli
Copy link
Contributor Author

Here is perf test.
I create a orc table named src_table with map<string, string> column named mapping, which stripe's string dictionary could occupy 466M memory.
Then I run a spark query to read this column:
insert overwrite table res_table select mapping['tag_a'] from src_table;

With old orc lib, only if I set executor-memory to equal or larger than 2500M, the query could run successfully. Otherwise the query will fail with OOM exception. Here is perf result with MAT tool when I run query with 2500M executor-memory
内存占用情况_DynamicByteArray

With orc lib with this new patch, the query could run successfully when executor-memory is decreased to 1200M.

@expxiaoli
Copy link
Contributor Author

@pgaref @wgtmac Could you review and verify it?
For reading string dictionary encoding column, this PR could reduce batch reading's memory usage to nearly row reading's memory usage, which could solve OOM issue in migration work from row reading to batch reading.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any potential regression in terms of the speed?

@expxiaoli
Copy link
Contributor Author

expxiaoli commented Dec 23, 2021

@dongjoon-hyun In my perf test, speed is no regression for this patch. Here is scan time result for spark run query "insert overwrite table res_table select mapping['tag_a'] from src_table"

"scan time" is spark metric in FileSourceScanExec class's doExecuteColumnar method , which only metric time for scan operator

executor-memory | new ORC with this patch | old ORC
2500M | 37.2s | 34.6s
2250M | 36.5s | OOM
1100M | 30.2s | OOM
1000M | OOM | OOM

Besides, this patch removes memory allocation for DynamicByteArray as well as memory copy from DynamicByteArray to primitive byte array, and do not add other time consuming logic. I think there is no potential regression for speed.

@guiyanakuang
Copy link
Member

Before
InStream ----> DynamicByteArray(data[][]) -----> byte[]
After
InStream ----> byte[]

DynamicByteArray plays no other role in the context and seems completely redundant, I approve of this pr

@dongjoon-hyun dongjoon-hyun added this to the 1.8.0 milestone Dec 25, 2021
@dongjoon-hyun dongjoon-hyun changed the title ORC-1060: reduce memory usage when vectorized reading dictionary string encoding columns ORC-1060: Reduce memory usage when vectorized reading dictionary string encoding columns Dec 25, 2021
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master. Thank you, @expxiaoli and @expxiaoli .
Merry Christmas!

@dongjoon-hyun dongjoon-hyun merged commit 3a2cb60 into apache:main Dec 25, 2021
@dongjoon-hyun
Copy link
Member

@expxiaoli . I added you to the Apache ORC contributor group and assigned ORC-1060 to you.
Welcome to the Apache ORC community.

dongjoon-hyun pushed a commit that referenced this pull request Dec 29, 2021
…ng encoding columns (#971)

### What changes were proposed in this pull request?

In old code, when dictionary string encoding columns are read by vectorized reading, 2 copy of current stripe's dictionary data and 1 copy of next stripe's dictionary data are hold in memory when reading across different stripes. That could make vectorized reading's memory usage is larger than row reading. This patch fixes this issue, and only hold 1 copy of current stripe's dictionary data.

This patch logic has 3 parts:

1) Directly read data to primitive byte array, rather than using DynamicByteArray as intermediate variable. Using DynamicByteArray as intermediate variable causes 2 copy of current stripe's dictionary data are hold in memory.

2) Lazy read dictionary data until read current batch data. In previous code, RecordReaderImpl class's nextBatch method reads dictionary data of next stripe through advanceToNextRow method, then memory will hold two stripe's dictionary data. Through lazy read logic, only one stripe's dictionary data is hold in memory when reading across different stripes.

3) Before lazy read dictionary data from current stripe, remove batch data's reference to dictionary data from previous stripe. This could allow GC to clean previous stripe's dictionary data memory.

### Why are the changes needed?

Reduce memory usage.

### How was this patch tested?

Pass the existing CIs.

(cherry picked from commit 3a2cb60)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 29, 2021

I cherry-picked this to branch-1.7 for next Apache ORC 1.7.3 release.
We will test the performance impact with the downstream projects before releasing.

@dongjoon-hyun dongjoon-hyun modified the milestones: 1.8.0, 1.7.3 Dec 29, 2021
@expxiaoli
Copy link
Contributor Author

Thanks @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants