-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12879][SQL] improve the unsafe row writing framework #10809
Conversation
// need to clear it out every time. | ||
"" | ||
} else { | ||
s"$rowWriter.zeroOutNullBites();" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I made a different decision compare to the unsafe parquet reader. We can clear out the null bits at beginning, and call UnsafeRowWriter.write
instead of UnsafeRow.setXXX
, which saves one null bits updating. If null values are rare, this one should be faster. I'll benchmark it later.
cc @nongli
Test build #49602 has finished for PR 10809 at commit
|
3978711
to
9a63852
Compare
zeroOutNullBites(); | ||
} | ||
|
||
public void zeroOutNullBites() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NullBytes
Test build #49606 has finished for PR 10809 at commit
|
// need to clear it out every time. | ||
"" | ||
} else { | ||
s"$rowWriter.zeroOutNullBytes();" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I made a different decision compare to the unsafe parquet reader. We can clear out the null bits at beginning, and call UnsafeRowWriter.write
instead of UnsafeRow.setXXX
, which saves one null bits updating. If null values are rare, this one should be faster. I'll benchmark it later.
cc @nongli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense for me.
Test build #49647 has finished for PR 10809 at commit
|
LGTM |
* A helper class to write data into global row buffer using `UnsafeRow` format. | ||
* | ||
* It will remember the offset of row buffer which it starts to write, and move the cursor of row | ||
* buffer while writing. If a new record comes, the cursor of row buffer will be reset, so we need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new record
mean nested struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the record that this writer is responsible to write, it can be the whole row record, or a nested struct, or even a struct type element in array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the new record
is not clear to me, it should be nested struct.
Test build #50026 has finished for PR 10809 at commit
|
LGTM, merging this into master, thanks! |
As we begin to use unsafe row writing framework(
BufferHolder
andUnsafeRowWriter
) in more and more places(UnsafeProjection
,UnsafeRowParquetRecordReader
,GenerateColumnAccessor
, etc.), we should add more doc to it and make it easier to use.This PR abstract the technique used in
UnsafeRowParquetRecordReader
: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily.a local benchmark shows
UnsafeProjection
is up to 1.7x faster after this PR:old version
new version
For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR.