-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: standardize KSQL up-casting #3516
fix: standardize KSQL up-casting #3516
Conversation
Logic for controlling what types can be up-cast to another, e.g. up-casting an INT to a BIGINT was spread across different parts of the code base, had no tests to ensure all parts were consistent and hence was inconsistent. Up-casting is primarily used during arithmetic binary operations, e.g. adding two numbers together, and when trying to coerce user supplied values in `INSERT VALUE` statements to the required SQL types. Numeric Up-casting rules for KSQL are: - `INT` can be up-cast to `BIGINT` - `BIGINT` can be up-cast to `DECIMAL` - `DECIMAL` can be up-cast to `DOUBLE`. In the existing code: - `SqlBaseType` has a `canUpCast` method, but it doesn't take `DECIMAL`s into account. - `SqlBaseType` has an `isNumber` method, but it doesn't treat `DECIMAL` as a numeric type. - `SchemaUtil` has the logic on how to resolve the resulting (connect) schema given an `Operation` and two input types. - `DefaultSqlValueCoercer`, allowed coercion of `DOUBLE` -> `DECIMAL`, which is against out up-casting rules. This PR looks to make the code a bit more object orientated and hopefully better structured. With this change: - `SqlBaseType`'s `canUpCast` and `isNumber` methodss correctly handle `DECIMAL`. - Any type can be up-cast to itself. Only numeric types can be up-cast to other types and the rules are encapsulated in `SqlBaseType.canUpCast`. - The logic on how to resolve the resulting SQL type given an `Operation` and two input types now lives in `Operation`, making use of the decimal specific handling which now lives in `SqlDecimal`. - The `SqlDecimal` type an `INT` or `BIGINT` is up-cast to is now stored in `SqlTypes`. However, the main benefit of this commit is the addition of tests in `DefaultSqlValueCoercer` and `OperatorTest` to ensure that these two classes follow the up-casting rules.
@@ -47,7 +47,7 @@ | |||
@JsonProperty("window") final WindowData window | |||
) { | |||
this.topicName = topicName == null ? "" : topicName; | |||
this.key = key == null ? "" : key; | |||
this.key = key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you're wondering... a null
key is value and so we shouldn't be converting to an empty string. Without this change the new tests in insert-values.json
would fail.
A general question: When upcasting from DECIMAL to DOUBLE how do we deal with cases where there will be a loss of precision? (I.e. some values that can be represented exactly as decimals can't be represented exactly as doubles). Do we just silently do the conversion, or do we warn? Perhaps we could log a warning when the query is parsed "Possible loss of precision in conversion from DECIMAL->DOUBLE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I am misreading the intent of the code, but regarding the approach taken here, there seems to be an assumption that it's possible to calculate the required scale and precision for a decimal to hold the result of an operation based just on the scale and precision of the operands.
I don't think that holds in general (divide is an example), the actual values need to be known.
I'm thinking that for an expression involving a decimal we should always assume scale and precision is maximum possible throughout the calculation, then we should only round right at the end if necessary if the user has specified a specific decimal scale and precision for a column type.
+ Math.max(left.precision - left.scale, right.precision - right.scale) | ||
+ 1; | ||
|
||
final int scale = Math.max(left.scale, right.scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why we don't leave it up to BigDecimal to figure out the scale and precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below your review.
final int precision = left.precision - left.scale + right.scale | ||
+ Math.max(6, left.scale + right.precision + 1); | ||
|
||
final int scale = Math.max(6, left.scale + right.precision + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we limit to 6 decimal places?
If we were dividing 1 by 3 and the result was a recurring fraction, wouldn't it be better to have as many dps as possible to give the best overall precision? 0.33333333333333333333333
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling would be to leave all the scale and precision calculations up to BigDecimal and only readjust the scale/precision at the end of the calculation if too many trailing zeros aren't desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand this code is trying to calculate the required scale and precision of a decimal for a particular operation given the scale and precision of the operands.
While that is possible for some operations (add, subtract and multiply) I don't think it's possible in general for all opeations, e.g. for divide as the required scale and precision for the most exact number depends on the actual numbers and not just the types of the numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below your review.
@Test | ||
public void shouldNotDownCastDouble() { | ||
assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.INTEGER), is(false)); | ||
assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.BIGINT), is(false)); | ||
assertThat(SqlBaseType.DOUBLE.canUpCast(SqlBaseType.DECIMAL), is(false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should perhaps allow casting of double to decimal in the case the double is smaller than a certain value as it can be exactly represented as a decimal.
BTW, imho I don't like the term "upcasting", why not just "casting"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should perhaps allow casting of double to decimal in the case the double is smaller than a certain value as it can be exactly represented as a decimal.
Not sure I follow: Casting of double to decimal is supported. Do you mean implicit down casting? I don't think we can support that as the casting happens when calculating the output type of a mathematical binary operation on two types, not values. So we can't take the value into account and must always have a deterministic result type based purely on the input types.
Again, see my comment below your review.
BTW, imho I don't like the term "upcasting", why not just "casting"?
Not a term I introduced, this was from @agavra. Though I'm assuming it's to differentiate from down-casting, which is different. In this context, the term up-casting is used to mean the ability for KSQL to implicitly convert one number type to another type, e.g. INT
to BIGINT
.
Might be better if the call was called canImplicitlyUpCast
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, the decimal code must calculate a fixed scale and precision based purely on the scale and precision of the input types, as the decimal values are not known when building the schema of the column into which the result of the computation will be stored, i.e. we build the schema long before processing any data.
I would argue that we shouldn't attempt to calculate a fixed scale and precision for the expression, instead just use maximum precision if there is any arithmetic where we cannot calculate precision properly in advance. Perhaps I am misunderatanding what is going on here but Rounding division to 6 dps seems just wrong to me.
Would you mind reviewing purely on the code changes, rather than the functionality that is being moved? :D
Sure, but I think this should be captured in another issue :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, imho I don't like the term "upcasting", why not just "casting"?
Not a term I introduced, this was from @agavra. Though I'm assuming it's to differentiate from down-casting, which is different. In this context, the term up-casting is used to mean the ability for KSQL to implicitly convert one number type to another type, e.g.
INT
toBIGINT
.Might be better if the call was called
canImplicitlyUpCast
?
Ok, so I think "upcasting" just means "implicitly cast". (I still don't like the term 'upcasting')
And we should only do that where we can guaranteed it's safe to do so. However implicitly casting from DECIMAL to DOUBLE is not always safe. I think we currently assume it's safe because a double can hold bigger values than a decimal, but for some conversions a loss of precision might occur. Hence my previous comment about logging a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way of dealing with this would be to not allow DECIMAL->DOUBLE implicit conversions, and require the user to do an explicit cast. That's another way of telling the user "hey this isn't safe, but if you really want to do it, go ahead and cast"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right about not using upcast/downcast - this was probably a terminology error on my end. Standard terminology should be widening
and narrowing
https://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling it canImplicitlyCastTo
would be better as we're talking about implicit casting, not casting in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think canWiden
is probably the technically correct term.
Hey @purplefox, thanks for the review! However... all of your comments are on existing code that's just been moved in this PR. Don't get me wrong, you may have valid points and concerns and we should not ignore these. But we should look to discuss and address outside of the scope of this PR, which is just refactoring the code and, for the most part, maintaining the existing functionality. In general, the decimal code must calculate a fixed scale and precision based purely on the scale and precision of the input types, as the decimal values are not known when building the schema of the column into which the result of the computation will be stored, i.e. we build the schema of a table long before processing any data. Of course, what scale and precision KSQL determines for any specific input can be up for debate. @agavra was the original developer of the decimal code and may be able to comment more. I believe his implementation is heavily influenced by the approach taken by other DBs, e.g. Postgres.
Again, this is not something I'm proposing in this PR, only enforcing and standardizing on. This rule preexisted this PR. As an aside, I had a long discussion with @agavra on this very subject as I'm not convinced by this rule either. But that's a discussion outside the scope of this PR. Would you mind reviewing purely on the code changes, rather than the functionality that is being moved? :D |
Conflicting files ksql-common/src/main/java/io/confluent/ksql/util/SchemaUtil.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving on the basis that this is just a code refactor, but I think there should be a discussions on the points raised in here regarding:
- Scale and precision of operations like division
- Implicit conversion of DECIMAL to DOUBLE
Hey @agavra - do you have some insight on the comments raised in this PR? |
Another point for discussion is that the INSERT VALUE logic does implicit conversion from STRING to DECIMAL, but not STRING to any other numeric type. |
Responding to the two big points: With regards to the scale/precision of mathematical operations on decimals, I took this directly from SqlServer semantics With regards to consider a DECIMAL -> DOUBLE conversion widening and not narrowing, there were a few things that convinced me it should be that way:
I've given it a lot of thought, and I think what really draws me to my conclusion is that I think it's "safer" to output a double to indicate it is imprecise than do some magic and call it exact. |
Test fixed by switching the RecordNode to having a custom parser that differentiates between a missing node and a node explicitly set to null.
I wonder what the rationale for SQLServer rounding to 6 dps is? It seems odd to me. If there are multiple operations in a single statement, the errors are going to multiply fast. I still believe that it would be better to do any calculation at maximum precision and only round at the end of the calculation, if necessary to make the result conform to the column schema.
Perhaps we are talking at cross purposes? ;) The point I raised wasn't about converting from doubles to decimals. It was about automatically converting from decimals to doubles and losing precision. I agree the calculation should be performed using doubles, but I think either a) the CAST from decimal to double should be explicit or b) a warning should be logged ("possible loss of precision in implicit cast from decimal to double"). |
I got up too early and was a bit bored so I dug into my SQLServer does this rounding. It seems that SQLServer does this to avoid exceeding max precision fo its decimal type which is 38. https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=92608 But in our case, we have BIgDecimal for calculations which has no such limitation on precision. So I don't think it makes a lot of sense to adopt the SQL Server way of doing things (unless there's a requirement for our behaviour to be exactly like SQL Server!) (of course we might need to round down at the end of the calculation to conform to the schema of the result column, but I'm referring to during the calculation here) |
@purplefox I think this makes sense - my calculus is usually to just, in order to avoid decision paralysis and move quickly, do what other established systems do because they set a decent precedent 95% of the time. If there's examples of that precedent being less than ideal (as you pointed out in that thread) I'm more than happy having a discussion on what the "correct" behavior should be.
I think logging in our case doesn't really work well because it would happen per-event (maybe we could log once when materializing the query, we should have the schemas at that point 🤔) so the only other option is to fail hard. Failing hard is pretty rare in SQL systems, which seem to try hard to make something work (albeit magically) rather than failing. As an infrastructure engineer that goes against all of my intuitions, but I think if we're developing a SQL product we should develop something that "feels" familiar to that audience. |
Description
Logic for controlling what types can be up-cast to another, e.g. up-casting an INT to a BIGINT was spread across different parts of the code base, had no tests to ensure all parts were consistent and hence was inconsistent.
Up-casting is primarily used during arithmetic binary operations, e.g. adding two numbers together, and when trying to coerce user supplied values in
INSERT VALUE
statements to the required SQL types.Numeric Up-casting rules for KSQL are:
INT
can be up-cast toBIGINT
BIGINT
can be up-cast toDECIMAL
DECIMAL
can be up-cast toDOUBLE
.In the existing code:
SqlBaseType
has acanUpCast
method, but it doesn't takeDECIMAL
s into account.SqlBaseType
has anisNumber
method, but it doesn't treatDECIMAL
as a numeric type.SchemaUtil
has the logic on how to resolve the resulting (connect) schema given anOperation
and two input types.DefaultSqlValueCoercer
, allowed coercion ofDOUBLE
->DECIMAL
, which is against out up-casting rules.This PR looks to make the code a bit more object orientated and hopefully better structured.
With this change:
SqlBaseType
'scanUpCast
andisNumber
methodss correctly handleDECIMAL
.SqlBaseType.canUpCast
.Operation
and two input types now lives inOperation
, making use of the decimal specific handling which now lives inSqlDecimal
.SqlDecimal
type anINT
orBIGINT
is up-cast to is now stored inSqlTypes
.However, the main benefit of this commit is the addition of tests in
DefaultSqlValueCoercer
andOperatorTest
to ensure that these two classes follow the up-casting rules.In the future we can look to ensure more behaviour into our type system.
Testing done
Suitable tests added, including awesome metatests to ensure different parts of the system are consistent.
Reviewer checklist
Outstanding discussion points on decimals (non-blocking):