-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23179][SQL] Support option to throw exception if overflow occurs during Decimal arithmetic #20350
Conversation
} else { | ||
val message = s"$toDebugString cannot be represented as Decimal($precision, $scale)." | ||
if (nullOnOverflow) { | ||
logWarning(s"$message NULL is returned.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should log this message. If we hit this often we'll end up with huge logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we hit it often, the result we get is quite useless. I added it only to notify the user of something which is an unexpected/undesired situation and now happens silently. I think it is bad that the user cannot know if a NULL is a result of an operation involving NULLs or the result of an overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a result becomes less useful if we return nulls often. My problem is more that if we process a million non convertible decimals we log the same message a million times, which is going to cause a significant regression. Moreover this is logged on the executor, an end-user typically does not look at those logs (there is also no reason to do so since the job does not throw an error).
My suggestion would be to not log at all, or just log once. I prefer not to log at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. And I agree with you. But I wanted to put some traces of what was happening What about using DEBUG as log level? In this case most of the time we are not logging anything, but if we want to check is an overflow is happening we can. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok with using debug/trace level logging. Can you make sure we do not construct the message unless we are logging or throwing the exception (changing val
into def
should be enough)?
Test build #86484 has finished for PR 20350 at commit
|
Test build #86488 has finished for PR 20350 at commit
|
Jenkins, retest this please |
Test build #86495 has finished for PR 20350 at commit
|
@@ -49,7 +49,6 @@ select 1e35 / 0.1; | |||
|
|||
-- arithmetic operations causing a precision loss are truncated | |||
select 123456789123456789.1234567890 * 1.123456789123456789; | |||
select 0.001 / 9876543210987654321098765432109876543.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is missing a ;
before...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, unfortunately I missed it somehow previously...
Test build #86524 has finished for PR 20350 at commit
|
.doc("When true (default), if an overflow on a decimal occurs, then NULL is returned. " + | ||
"Spark's older versions and Hive behave in this way. If turned to false, SQL ANSI 2011 " + | ||
"specification, will be followed instead: an arithmetic exception is thrown. This is " + | ||
"what most of the SQL databases do.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny nit:
If turned to false, SQL ANSI 2011 specification, will be followed instead
This should be
If turned to false, SQL ANSI 2011 specification will be followed instead
Test build #86528 has finished for PR 20350 at commit
|
Test build #86533 has finished for PR 20350 at commit
|
kindly ping @gatorsmile @cloud-fan |
Thanks for your contributions! Could you ping us again after 2.3 release? |
sure, thanks @gatorsmile |
Test build #87154 has finished for PR 20350 at commit
|
retest this please |
Test build #87794 has finished for PR 20350 at commit
|
the error is unrelated, and I am seeing it frequently throughout the code. It seems something caused the flakiness to increase for this test. There is already a ticket for it: SPARK-23369, but it is becoming more and more important to fix it. It would be great also to check what increased the flakiness... |
retest this please |
sorry @gatorsmile, now that RC for 2.3 has passed the vote, do you happen to have time to look at this? Thanks. |
Test build #87841 has finished for PR 20350 at commit
|
Sure, will do the review in the next few days. |
Test build #92087 has finished for PR 20350 at commit
|
retest this please |
Test build #93072 has finished for PR 20350 at commit
|
My understanding from #21499 (comment) is that the plan you have in mind is to have this in 3.0 and not in 2.4 @gatorsmile , am I right? If this is the case, shall I close this now and reopen once 2.4 is out? Thanks. |
Test build #93401 has finished for PR 20350 at commit
|
hi @mgaido91! I'm hoping to have this feature in 3.0 too. Thank you for the work here :) For now I'm trying to cherry-pick this to our local fork. I'd love your input here please: how do you think we would handle overflow problem when our dataset doesn't involve any arithmetic operation on Sql type? When it is just round-tripping between jvm BigDecimal to sql Decimal, we can still get back null. Concretely, this PR would fix nicely throw exception instead of null when we have the option enabled.
However, dataset round-tripping here will still return null
|
@mickjermsurawong-stripe thanks for your comment. I am updating this PR resolving the conflicts and I hope that your feedback will help this PR moving forward. As far as your question is regarded, you may consider adding an |
Test build #106780 has finished for PR 20350 at commit
|
Test build #106782 has finished for PR 20350 at commit
|
@@ -1441,6 +1441,16 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val DECIMAL_OPERATIONS_NULL_ON_OVERFLOW = | |||
buildConf("spark.sql.decimalOperations.nullOnOverflow") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overflow can happen with non-decimal operations, do we need a new config?
cc @JoshRosen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look at this @cloud-fan !
Yes, that case (non-decimal) is handled in #21599. I'd say that, in the non-decimal case, the situation is pretty different. Indeed, overflow in decimal operation is handled by Spark now, converting overflow operations to null
; while overflow in operation on non-decimal isn't handled at all currently.
In non-decimal operations, indeed we return a wrong value (the java way). So IMHO, the non-decimal case current behavior doesn't make any sense at all (considering this is SQL and not a low level language like Java/Scala) and keeping its current behavior makes no sense (we already discussed this in that PR actually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DB does not have to follow the SQL standard completely in every corners. The current behavior in Spark is by design and I don't think that's nonsense.
I do agree that it's a valid requirement that some users want overflow to fail, but it should be protected by a config.
My question is if we need one config for overflow, or 2 configs for decimal and non-decimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A DB does not have to follow the SQL standard completely in every corners. The current behavior in Spark is by design and I don't think that's nonsense.
I am sorry, but I don't really agree with you on this. I see the discussion is a bit OT, but I'd like just to explain the reasons of my opinion. SQL is a declarative language and here we are coupling the result/behavior to the specific execution language we are using. Spark is cross-language, but for arithmetic operations overflow works in a very peculiar way of the language we use which is:
- against SQL standards and no other DB works differently from SQL standards w.r.t. this, so very surprising (at least) for SQL users;
- different from what happens in Python and in R when you overflow in those languages (an Int becomes long and so on there);
So there in no Spark user other than Scala/Java ones who might understand the behavior Spark has in those cases. Sorry for being a bit OT, anyway.
My question is if we need one config for overflow, or 2 configs for decimal and non-decimal.
Yes, this is the main point here. IMHO, I'd prefer 2 configs because when the config is turned off, the behavior is completely different: in once case it returns null, in the other we return wrong results. But I see also the value in reducing as much as possible the number of configs, which is already pretty big. So I'd prefer 2 configs, but if you and the community thinks 1 it is better, I can update the PR in order to make this config more generic.
Thanks for your feedbacks and the discussion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I think separate flags are okay. Here's why:
- While eventually we probably want to add flaggable non-Decimal overflow detection (see [SPARK-26218][SQL] Overflow on arithmetic operations returns incorrect result #21599 (comment)), these PRs should land separately (to limit scope of changes / code review). If we give this PR's flag a generic name, merge this PR, and then somehow fail to merge the integer overflow PR in time for 3.0 then we'd be facing a situation where we'd need to change the behavior of a released flag if we later merge the non-Decimal overflow PR.
- If we implement separate flags for each type of overflow then that doesn't preclude us from later introducing a single flag which is used as the default value for the per-type flags.
I'm interested in whichever option allows us to make incremental progress by getting this merged (even if flagged off by default) so that we can rely on this functionality being available in 3.x instead of having to maintain it indefinitely in our own fork (with all of the associated long-term maintenance and testing burdens).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One followup question regarding flag naming: is "overflow" the most precise term for the change made here? Or does this flag also change behavior in precision-loss scenarios? Maybe I'm getting tripped up on terminology here, since insufficient precision to represent small fractional quantities is essentially an "overflow" of the digit space reserved to represent the fractional part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments @JoshRosen.
Yes, this deals with the overflow case. The underflow (or precision loss) is handled in a different way and the behavior depends on another config (see SPARK-22036): it either avoids precision loss, causing eventually overflow (old behavior) or truncates (as defined by the SQL standard and following closely SQL server behavior from which we derived our decimal operations implementation). So this flag is related only to the overflow case.
hi @mgaido91 I find two other places where we might want to add this check for consistent behavior
|
@mickjermsurawong-stripe let me answer to your two comments separately:
Not really, when we serialize a SQL decimal to a Java/Scala BigDecimal, we cannot have overflow. So this patch/case doesn't apply (as you can see, there is not even a
Thanks for pointing this out! Actually this is a weird operator, because it does throw an exception on overflow in interpreted mode, while in codegen it doesn't. Moreover, it throw a |
SGTM I'm merging this PR, thanks! |
thanks @cloud-fan and thank you all for the reviews! |
I think we should catch overflowing when encoding Java/Scala BigDecimal to SQL decimal (not the other way round), and that happens at the SerializerBuildHelper. To the current test structure in
To pinpoint at the encoder part, the new test here shows null row for decimal type.
I can make a separate PR on this if this sounds good to you. |
+1; it sounds like the pre-existing difference between the codegen and interpreted paths is a separate, pre-existing bug. It's also especially hard to reason about because (AFAIK) the paths in For consistency, I think we should:
Let's make a followup JIRA for this change and a separate JIRA for the encoder changes @mickjermsurawong-stripe discussed in his comment (I can loop back later this morning or afternoon to help file these). Edit: in @mickjermsurawong-stripe's PR we can improve test coverage for both sets of encoders ( |
I see now, sorry for misunderstanding @mickjermsurawong-stripe. I think it is fine to go ahead with your PR. I created https://issues.apache.org/jira/browse/SPARK-28200 for it. So please go ahead submitting your PR for that JIRA. I created also https://issues.apache.org/jira/browse/SPARK-28201 for the Thanks! |
## What changes were proposed in this pull request? In SPARK-23179, it has been introduced a flag to control the behavior in case of overflow on decimals. The behavior is: returning `null` when `spark.sql.decimalOperations.nullOnOverflow` (default and traditional Spark behavior); throwing an `ArithmeticException` if that conf is false (according to SQL standards, other DBs behavior). `MakeDecimal` so far had an ambiguous behavior. In case of codegen mode, it returned `null` as the other operators, but in interpreted mode, it was throwing an `IllegalArgumentException`. The PR aligns `MakeDecimal`'s behavior with the one of other operators as defined in SPARK-23179. So now both modes return `null` or throw `ArithmeticException` according to `spark.sql.decimalOperations.nullOnOverflow`'s value. Credits for this PR to mickjermsurawong-stripe who pointed out the wrong behavior in apache#20350. ## How was this patch tested? improved UTs Closes apache#25010 from mgaido91/SPARK-28201. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request? - Currently, `ExpressionEncoder` does not handle bigdecimal overflow. Round-tripping overflowing java/scala BigDecimal/BigInteger returns null. - The serializer encode java/scala BigDecimal to to sql Decimal, which still has the underlying data to the former. - When writing out to UnsafeRow, `changePrecision` will be false and row has null value. https://github.com/apache/spark/blob/24e1e41648de58d3437e008b187b84828830e238/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java#L202-L206 - In [SPARK-23179](apache#20350), an option to throw exception on decimal overflow was introduced. - This PR adds the option in `ExpressionEncoder` to throw when detecting overflowing BigDecimal/BigInteger before its corresponding Decimal gets written to Row. This gives a consistent behavior between decimal arithmetic on sql expression (DecimalPrecision), and getting decimal from dataframe (RowEncoder) Thanks to mgaido91 for the very first PR `SPARK-23179` and follow-up discussion on this change. Thanks to JoshRosen for working with me on this. ## How was this patch tested? added unit tests Closes apache#25016 from mickjermsurawong-stripe/SPARK-28200. Authored-by: Mick Jermsurawong <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
SQL ANSI 2011 states that in case of overflow during arithmetic operations, an exception should be thrown. This is what most of the SQL DBs do (eg. SQLServer, DB2). Hive currently returns NULL (as Spark does) but HIVE-18291 is open to be SQL compliant.
The PR introduce an option to decide which behavior Spark should follow, ie. returning NULL on overflow or throwing an exception.
How was this patch tested?
added UTs