Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CAST(varchar as decimal) #5307

Closed
wants to merge 2 commits into from

Conversation

jinchengchenghh
Copy link
Contributor

@jinchengchenghh jinchengchenghh commented Jun 19, 2023

@netlify
Copy link

netlify bot commented Jun 19, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 8acee32
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/65b074afe07ef60008b36070

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 19, 2023
@jinchengchenghh
Copy link
Contributor Author

Could you help take a look? @Yuhta

@Yuhta Yuhta self-requested a review June 20, 2023 14:12
@jinchengchenghh jinchengchenghh force-pushed the cast branch 2 times, most recently from bf1821f to 8587a2b Compare June 30, 2023 02:53
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak Deepak, would you help review this PR?

@rui-mo
Copy link
Collaborator

rui-mo commented Jul 25, 2023

Can we add Spark's implementation to the PR description, and ensure all cases are considered in unit test?

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Is this emulating the Java implementation? I see a lot of string manipulations.
Can we avoid them and work with offsets for the dot and exponent on the original string?

@jinchengchenghh
Copy link
Contributor Author

No, java uses JDK API to convert to decimal.

  private def stringToJavaBigDecimal(str: UTF8String): JavaBigDecimal = {
    // According the benchmark test,  `s.toString.trim` is much faster than `s.trim.toString`.
    // Please refer to https://github.com/apache/spark/pull/26640
    new JavaBigDecimal(str.toString.trim)
  }

Where does the convert number with exponent come from? @rui-mo

@rui-mo
Copy link
Collaborator

rui-mo commented Jul 27, 2023

Where does the convert number with exponent come from? @rui-mo

@jinchengchenghh They are from Spark unit tests.

@jinchengchenghh
Copy link
Contributor Author

jinchengchenghh commented Jul 27, 2023

@jinchengchenghh Is this emulating the Java implementation? I see a lot of string manipulations. Can we avoid them and work with offsets for the dot and exponent on the original string?

Convert number with exponent comes from @rui-mo , I will try to optimize it.

I found postgresql has implementation to convert exponent varchar to numeric, https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/numeric.c#L4515, I'm not sure if we can write as this.
Or we should implement by ourselves? @majetideepak

@majetideepak
Copy link
Collaborator

majetideepak commented Jul 27, 2023

@jinchengchenghh This seems to be the main implementation https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/numeric.c#L6942
This is very C-style ish. I don't think we need to be this minimal but is a good start. I don't have an answer here.
Maybe you look at other C++ open-source implementations like Impala or DuckDB as well.

@jinchengchenghh jinchengchenghh changed the title Add support to cast varchar to decimal Add CAST(varchar as decimal) Aug 11, 2023
@jinchengchenghh
Copy link
Contributor Author

Update to Arrow implementation, can you help to review again? Thanks! @majetideepak

@jinchengchenghh
Copy link
Contributor Author

Do you have further comments? @majetideepak

Copy link
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh
Copy link
Contributor Author

Could you help merge it? Thanks! @mbasmanova

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@majetideepak @jinchengchenghh Have we verified that this behavior matches Presto?

@jinchengchenghh
Copy link
Contributor Author

I have verified it in Presto. @mbasmanova

@jinchengchenghh
Copy link
Contributor Author

Presto throws exception when input is illegal while Spark return NULL, this behavior also matches Presto.

@mbasmanova
Copy link
Contributor

I have verified it in Presto.

Thanks. Would you update PR description to clarify?

Presto throws exception when input is illegal while Spark return NULL, this behavior also matches Presto.

Sounds like there are differences in behavior for Presto and Spark. Is this so? Should there be some fork in the code then?

auto sourceVector = input.as<SimpleVector<StringView>>();
auto castResultRawBuffer =
castResult->asUnchecked<FlatVector<TOutput>>()->mutableRawValues();
const auto& toPrecisionScale = getDecimalPrecisionScale(*toType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getDecimalPrecisionScale returns a temp value, hence, it should be captured by value, not by reference.

@rui-mo rui-mo force-pushed the cast branch 3 times, most recently from 01a454c to b2abad7 Compare January 19, 2024 13:11
@rui-mo
Copy link
Collaborator

rui-mo commented Jan 19, 2024

Hi @mbasmanova @majetideepak This PR has been updated. Could you help review again? Changes including 1) using Status instead of throwing exception 2) using cast hook to support different behaviors on white-spaces handling between Presto and Spark.

@@ -236,7 +236,7 @@ class CastBaseTest : public FunctionBaseTest {
testCast(fromType, toType, inputVector, expectedVector);
}

template <typename TFrom, typename TTo>
template <typename TFrom>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove typename TTo as it is not in use. Do I need to do this refactor in a separate PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice change. It would be great to extract it into a separate PR to make reviewing easier and faster. Thanks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract this change to #8464. Thanks.

@@ -146,40 +147,40 @@ class DecimalUtil {
}

template <typename TInput, typename TOutput>
inline static std::optional<TOutput> rescaleWithRoundUp(
inline static Status rescaleWithRoundUp(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is PR is a bit large and complicated. Would it be possible to extract this particular change into a separate PR to make review easier?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract this change to #8465.

testThrow<StringView>(
VARCHAR(),
DECIMAL(38, 0),
toStringViews({std::string(280, '9')}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using toStringViews here, maybe add a testThrow<std::string>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that StringView doesn't own the string, hence, it is invalid to create StringView from a temporary std::string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. Changed to use testThrow<std::string>.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui, thank you for working on this change. I'll try to review it early next week. Please, ping me if you don't hear from me before Tue.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Looks good to me % small comments.

^^^^^^^^^^^^

Casting varchar to a decimal of given precision and scale is allowed.
Leading and trailing white-spaces in input varchars are allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is much shorter than the one in presto directory. Is this because the behavior is just like in Presto with a single exception of allowing leading and trailing spaces? It would be nice to clarify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the case, so I assume we don't need add duplicate description. Clarify in the description.

/// | 1.5 | 1 | 5 | nullopt | 1 |
/// | -1.5 | 1 | 5 | nullopt | -1 |
/// | 31.523e-2 | 31 | 523 | -2 | 1 |
struct DecimalComponents {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, put this into a 'detail' namespace

same for other implementation-only code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved them to detail namespace. Thanks.

break;
}
}
out = std::string_view(s + start, pos - start);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to return std::string_view?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

// The exponent part only contains sign.
return std::nullopt;
}
// Make sure all chars after sign are digits, as folly does not prevent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as folly does not prevent leading and trailing whitespaces.

maybe, as folly::tryTo allows leading and trailing whitespaces.

}
}
const auto tryExp =
folly::tryTo<int32_t>(folly::StringPiece(s + pos, size - pos));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know there cannot be any errors, perhaps, use folly::try.

BTW, what happens if the number is too large (too many digits)?

Copy link
Collaborator

@rui-mo rui-mo Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use folly::to here.

what happens if the number is too large (too many digits)?

If it is too large to fit in a int128_t value, overflow occurs in parseHugeInt method. If it can be converted to a int128_t value, but cannot be reprensented with given precision and scale, overflow occurs in rescaleWithRoundUp method. Tests are at https://github.com/facebookincubator/velox/pull/5307/files#diff-f41a5cc1aa75bd9ae99c7ed93b7a6fe859f3796183fa45969032d5836b9f19e8R1830-R1888.

T& decimalValue) {
const auto decimalComponentsOpt = parseDecimalComponents(s.data(), s.size());
if (!decimalComponentsOpt.has_value()) {
return Status::UserError("Value is not a number.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, include the value in the error message to help troubleshooting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full error message is like below, which contains the value. Do we need to add value here?

Cannot cast VARCHAR '1.23 ' to DECIMAL(38, 0). Value is not a number.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you for clarifying. Looks good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, how would a user understand why 1.23 is not a number?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case, it is because of the trailing white-space. Do we need to add the detailed reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to make error message as clear as possible to avoid unnecessary support calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated. The error message now contains status message from parseDecimalComponent.


/// Multiples the output by required power of 10, and adds the parsed value of
/// input with the rescaled ouput.
bool shiftAndAdd(std::string_view input, int128_t& out) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called only twice. First time out is zero. When out is zero there is no need to multiple by 10 ^ len(input).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I removed this function, and combined the two calls as parseHugeInt. The new function creates a huge int from decimal components.

@rui-mo rui-mo force-pushed the cast branch 2 times, most recently from e3d5d81 to 8b5bcd6 Compare January 23, 2024 13:09
@rui-mo rui-mo force-pushed the cast branch 2 times, most recently from 3ac4a7c to 8acee32 Compare January 24, 2024 02:23
@rui-mo
Copy link
Collaborator

rui-mo commented Jan 24, 2024

@mbasmanova Above comments were fixed. Could you review again? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor

@rui-mo Would you rebase?

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in 077fd73.

Copy link

Conbench analyzed the 1 benchmark run on commit 077fd735.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

@rui-mo
Copy link
Collaborator

rui-mo commented Jan 25, 2024

Thank you all for your great help!

@mbasmanova
Copy link
Contributor

@rui-mo Thank you for the contribution. It was quite a lot of work. Much appreciated.

@jinchengchenghh
Copy link
Contributor Author

Many thanks for your kind and warm help!@rui-mo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants