-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CAST(varchar as decimal) #5307
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Could you help take a look? @Yuhta |
bf1821f
to
8587a2b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak Deepak, would you help review this PR?
Can we add Spark's implementation to the PR description, and ensure all cases are considered in unit test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Is this emulating the Java implementation? I see a lot of string manipulations.
Can we avoid them and work with offsets for the dot and exponent on the original string?
No, java uses JDK API to convert to decimal.
Where does the convert number with exponent come from? @rui-mo |
@jinchengchenghh They are from Spark unit tests. |
Convert number with exponent comes from @rui-mo , I will try to optimize it. I found postgresql has implementation to convert exponent varchar to numeric, https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/numeric.c#L4515, I'm not sure if we can write as this. |
@jinchengchenghh This seems to be the main implementation https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/numeric.c#L6942 |
Update to Arrow implementation, can you help to review again? Thanks! @majetideepak |
Do you have further comments? @majetideepak |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jinchengchenghh
Could you help merge it? Thanks! @mbasmanova |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majetideepak @jinchengchenghh Have we verified that this behavior matches Presto?
I have verified it in Presto. @mbasmanova |
Presto throws exception when input is illegal while Spark return NULL, this behavior also matches Presto. |
Thanks. Would you update PR description to clarify?
Sounds like there are differences in behavior for Presto and Spark. Is this so? Should there be some fork in the code then? |
velox/expression/CastExpr-inl.h
Outdated
auto sourceVector = input.as<SimpleVector<StringView>>(); | ||
auto castResultRawBuffer = | ||
castResult->asUnchecked<FlatVector<TOutput>>()->mutableRawValues(); | ||
const auto& toPrecisionScale = getDecimalPrecisionScale(*toType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getDecimalPrecisionScale returns a temp value, hence, it should be captured by value, not by reference.
01a454c
to
b2abad7
Compare
Hi @mbasmanova @majetideepak This PR has been updated. Could you help review again? Changes including 1) using Status instead of throwing exception 2) using cast hook to support different behaviors on white-spaces handling between Presto and Spark. |
@@ -236,7 +236,7 @@ class CastBaseTest : public FunctionBaseTest { | |||
testCast(fromType, toType, inputVector, expectedVector); | |||
} | |||
|
|||
template <typename TFrom, typename TTo> | |||
template <typename TFrom> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove typename TTo
as it is not in use. Do I need to do this refactor in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice change. It would be great to extract it into a separate PR to make reviewing easier and faster. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract this change to #8464. Thanks.
velox/type/DecimalUtil.h
Outdated
@@ -146,40 +147,40 @@ class DecimalUtil { | |||
} | |||
|
|||
template <typename TInput, typename TOutput> | |||
inline static std::optional<TOutput> rescaleWithRoundUp( | |||
inline static Status rescaleWithRoundUp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is PR is a bit large and complicated. Would it be possible to extract this particular change into a separate PR to make review easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract this change to #8465.
testThrow<StringView>( | ||
VARCHAR(), | ||
DECIMAL(38, 0), | ||
toStringViews({std::string(280, '9')}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using toStringViews here, maybe add a testThrow<std::string>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice that StringView doesn't own the string, hence, it is invalid to create StringView from a temporary std::string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. Changed to use testThrow<std::string>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Rui, thank you for working on this change. I'll try to review it early next week. Please, ping me if you don't hear from me before Tue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rui-mo Looks good to me % small comments.
^^^^^^^^^^^^ | ||
|
||
Casting varchar to a decimal of given precision and scale is allowed. | ||
Leading and trailing white-spaces in input varchars are allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is much shorter than the one in presto directory. Is this because the behavior is just like in Presto with a single exception of allowing leading and trailing spaces? It would be nice to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the case, so I assume we don't need add duplicate description. Clarify in the description.
velox/expression/CastExpr-inl.h
Outdated
/// | 1.5 | 1 | 5 | nullopt | 1 | | ||
/// | -1.5 | 1 | 5 | nullopt | -1 | | ||
/// | 31.523e-2 | 31 | 523 | -2 | 1 | | ||
struct DecimalComponents { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, put this into a 'detail' namespace
same for other implementation-only code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved them to detail
namespace. Thanks.
velox/expression/CastExpr-inl.h
Outdated
break; | ||
} | ||
} | ||
out = std::string_view(s + start, pos - start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner to return std::string_view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks.
velox/expression/CastExpr-inl.h
Outdated
// The exponent part only contains sign. | ||
return std::nullopt; | ||
} | ||
// Make sure all chars after sign are digits, as folly does not prevent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as folly does not prevent leading and trailing whitespaces.
maybe, as folly::tryTo allows leading and trailing whitespaces.
velox/expression/CastExpr-inl.h
Outdated
} | ||
} | ||
const auto tryExp = | ||
folly::tryTo<int32_t>(folly::StringPiece(s + pos, size - pos)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we know there cannot be any errors, perhaps, use folly::try.
BTW, what happens if the number is too large (too many digits)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use folly::to here.
what happens if the number is too large (too many digits)?
If it is too large to fit in a int128_t value, overflow occurs in parseHugeInt
method. If it can be converted to a int128_t value, but cannot be reprensented with given precision and scale, overflow occurs in rescaleWithRoundUp
method. Tests are at https://github.com/facebookincubator/velox/pull/5307/files#diff-f41a5cc1aa75bd9ae99c7ed93b7a6fe859f3796183fa45969032d5836b9f19e8R1830-R1888.
velox/expression/CastExpr-inl.h
Outdated
T& decimalValue) { | ||
const auto decimalComponentsOpt = parseDecimalComponents(s.data(), s.size()); | ||
if (!decimalComponentsOpt.has_value()) { | ||
return Status::UserError("Value is not a number."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps, include the value in the error message to help troubleshooting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full error message is like below, which contains the value. Do we need to add value here?
Cannot cast VARCHAR '1.23 ' to DECIMAL(38, 0). Value is not a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thank you for clarifying. Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, how would a user understand why 1.23 is not a number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, it is because of the trailing white-space. Do we need to add the detailed reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to make error message as clear as possible to avoid unnecessary support calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated. The error message now contains status message from parseDecimalComponent.
velox/expression/CastExpr-inl.h
Outdated
|
||
/// Multiples the output by required power of 10, and adds the parsed value of | ||
/// input with the rescaled ouput. | ||
bool shiftAndAdd(std::string_view input, int128_t& out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called only twice. First time out is zero. When out is zero there is no need to multiple by 10 ^ len(input).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. I removed this function, and combined the two calls as parseHugeInt
. The new function creates a huge int from decimal components.
e3d5d81
to
8b5bcd6
Compare
3ac4a7c
to
8acee32
Compare
@mbasmanova Above comments were fixed. Could you review again? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@rui-mo Would you rebase? |
@mbasmanova merged this pull request in 077fd73. |
Conbench analyzed the 1 benchmark run on commit There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
Thank you all for your great help! |
@rui-mo Thank you for the contribution. It was quite a lot of work. Much appreciated. |
Many thanks for your kind and warm help!@rui-mo |
Spark implementation: https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L822
Derives from Arrow function
DecimalFromString
.Arrow implementation: https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/decimal.cc#L637