-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][SPARK-29221][SQL] LocalTableScanExec: handle the case where executors are accessing "null" rows #25913
Conversation
I'll remove [WIP] once the 5 sequential tests will not fail on |
retest this, please |
3 similar comments
retest this, please |
retest this, please |
retest this, please |
Test build #111273 has finished for PR 25913 at commit
|
NPE got resolved in LocalTableScanExec, but further step triggers another exception:
and actual stack trace is
|
I guess the approach should be changed, as there seems to be plenty of pieces in physical plans which are not safe to execute in executors. Something might seem to be changed recently. spark/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala Lines 93 to 102 in 7c02c14
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala Lines 63 to 72 in c61270f
spark/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala Lines 976 to 983 in fff2e84
When values/methods in SparkPlan are referenced in executor side, |
Test build #111274 has finished for PR 25913 at commit
|
Test build #111275 has finished for PR 25913 at commit
|
Test build #111276 has finished for PR 25913 at commit
|
@@ -37,7 +37,7 @@ case class LocalTableScanExec( | |||
"numOutputRows" -> SQLMetrics.createMetric(sparkContext, "number of output rows")) | |||
|
|||
@transient private lazy val unsafeRows: Array[InternalRow] = { | |||
if (rows.isEmpty) { | |||
if (rows == null || rows.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot remove @transient
instead of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding transient is to avoid serializing and copying whole rows, so I guess it's preferred one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. I'm a bit worried that this value could be different between a driver and executors. Which code touches this variable in executors? What's the stacktrace for unsafeRows
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, that's actually a kind of defensive programming, as it would bring NPE if there's no check for null. I haven't seen any actual stack trace to refer to unsafeRows
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think unsafeRows
is not expected to be called in executor sides, so we need to check assert(rows != null)
instead of the current approach. If it fails then, we need to fix some code not to call this variable in executors.
Test build #111277 has finished for PR 25913 at commit
|
It would be nice if someone in Spark SQL expert in depth could spend some time to investigate origin issue. I could apply sequential of band-aids until the further error message is gone, but I feel there might be another root issue on this. |
Just updated the observation of second kind of error pattern. |
Triggering new 3 builds to see the chance of encountering new occurrence. |
retest this, please |
Test build #111289 has started for PR 25913 at commit |
retest this, please |
Test build #111291 has started for PR 25913 at commit |
retest this, please |
Test build #111292 has started for PR 25913 at commit |
OK all three builds were cancelled. I'll just retrigger these builds. |
retest this, please |
Just reported the build failure to dev. mailing list as it seems to be env. issue. Not all PR builds have been failed, so worth retriggering. |
retest this, please |
cc. @gatorsmile @cloud-fan @viirya to kindly ask for help to investigate actual cause. |
if (rows.isEmpty) { | ||
if (rows == null) { | ||
Iterator("<unknown>", output) | ||
} else if (rows.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to return different values in a driver and executors? How about computing this value in a driver then passing into executors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value except output
should be meaningless, otherwise assumption in related comments goes wrong and we should serialize the rows. I guess that's only for visual.
7079481
to
c3b59c6
Compare
I'm wondering why the test fails intermittently, as the stack trace should always bring NPE. My bet is that it goes to the conditional case and trigger this.
EDIT: |
Test build #111309 has finished for PR 25913 at commit
|
Test build #111312 has finished for PR 25913 at commit
|
Test build #111311 has finished for PR 25913 at commit
|
Test build #111315 has finished for PR 25913 at commit
|
Submitted #25925 to fix it. |
Anyway, nice catch, @HeartSaVioR ! |
Happy to help! Thanks again @viirya to deal with root cause. :) |
Thanks @HeartSaVioR for investigating this issue! |
What changes were proposed in this pull request?
This patch proposes to fix the flaky test in
SQLQueryTestSuite.sql (subquery/scalar-subquery/scalar-subquery-select.sql)
, which seems to throw NPE in executor side.Both
rows
andunsafeRows
in LocalTableScanExec are defined as@transient
assuming that they're not needed in executors, but according to the error message and stack trace, at leastrows
has a path to be accessed from executors.unsafeRows
will also throw NPE when executors access the value, as it accesses rows which would be null - this patch fixes it altogether.Why are the changes needed?
This patch tries to fix a flaky test.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Will trigger CI build multiple times as
build/sbt "~sql/test-only *SQLQueryTestSuite -- -z subquery/scalar-subquery/scalar-subquery-select.sql"
doesn't fail in my local dev.