Add CAST(varchar as decimal) #5307

jinchengchenghh · 2023-06-19T08:46:18Z

Spark implementation: https://github.com/apache/spark/blob/branch-3.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L822

Derives from Arrow function DecimalFromString.
Arrow implementation: https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/decimal.cc#L637

netlify · 2023-06-19T08:46:23Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`8acee32`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/65b074afe07ef60008b36070

jinchengchenghh · 2023-06-20T01:02:46Z

Could you help take a look? @Yuhta

mbasmanova

@majetideepak Deepak, would you help review this PR?

rui-mo · 2023-07-25T13:26:29Z

Can we add Spark's implementation to the PR description, and ensure all cases are considered in unit test?

majetideepak

@jinchengchenghh Is this emulating the Java implementation? I see a lot of string manipulations.
Can we avoid them and work with offsets for the dot and exponent on the original string?

jinchengchenghh · 2023-07-27T00:51:16Z

No, java uses JDK API to convert to decimal.

  private def stringToJavaBigDecimal(str: UTF8String): JavaBigDecimal = {
    // According the benchmark test,  `s.toString.trim` is much faster than `s.trim.toString`.
    // Please refer to https://github.com/apache/spark/pull/26640
    new JavaBigDecimal(str.toString.trim)
  }

Where does the convert number with exponent come from? @rui-mo

rui-mo · 2023-07-27T00:56:52Z

Where does the convert number with exponent come from? @rui-mo

@jinchengchenghh They are from Spark unit tests.

jinchengchenghh · 2023-07-27T02:47:36Z

@jinchengchenghh Is this emulating the Java implementation? I see a lot of string manipulations. Can we avoid them and work with offsets for the dot and exponent on the original string?

Convert number with exponent comes from @rui-mo , I will try to optimize it.

I found postgresql has implementation to convert exponent varchar to numeric, https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/numeric.c#L4515, I'm not sure if we can write as this.
Or we should implement by ourselves? @majetideepak

majetideepak · 2023-07-27T20:30:16Z

@jinchengchenghh This seems to be the main implementation https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/numeric.c#L6942
This is very C-style ish. I don't think we need to be this minimal but is a good start. I don't have an answer here.
Maybe you look at other C++ open-source implementations like Impala or DuckDB as well.

jinchengchenghh · 2023-08-14T02:22:26Z

Update to Arrow implementation, can you help to review again? Thanks! @majetideepak

velox/expression/CastExpr-inl.h

velox/expression/tests/CastExprTest.cpp

velox/type/DecimalUtil.h

velox/expression/tests/CastExprTest.cpp

velox/expression/CastExpr-inl.h

jinchengchenghh · 2023-08-21T00:23:21Z

Do you have further comments? @majetideepak

majetideepak

Thanks, @jinchengchenghh

jinchengchenghh · 2023-08-22T01:16:45Z

Could you help merge it? Thanks! @mbasmanova

mbasmanova

@majetideepak @jinchengchenghh Have we verified that this behavior matches Presto?

jinchengchenghh · 2023-08-22T01:20:40Z

I have verified it in Presto. @mbasmanova

jinchengchenghh · 2023-08-22T01:22:18Z

Presto throws exception when input is illegal while Spark return NULL, this behavior also matches Presto.

mbasmanova · 2023-08-22T01:43:27Z

I have verified it in Presto.

Thanks. Would you update PR description to clarify?

Presto throws exception when input is illegal while Spark return NULL, this behavior also matches Presto.

Sounds like there are differences in behavior for Presto and Spark. Is this so? Should there be some fork in the code then?

mbasmanova · 2023-08-22T01:44:49Z

velox/expression/CastExpr-inl.h

+  auto sourceVector = input.as<SimpleVector<StringView>>();
+  auto castResultRawBuffer =
+      castResult->asUnchecked<FlatVector<TOutput>>()->mutableRawValues();
+  const auto& toPrecisionScale = getDecimalPrecisionScale(*toType);


getDecimalPrecisionScale returns a temp value, hence, it should be captured by value, not by reference.

rui-mo · 2024-01-19T14:15:30Z

Hi @mbasmanova @majetideepak This PR has been updated. Could you help review again? Changes including 1) using Status instead of throwing exception 2) using cast hook to support different behaviors on white-spaces handling between Presto and Spark.

rui-mo · 2024-01-19T14:14:48Z

velox/functions/prestosql/tests/CastBaseTest.h

@@ -236,7 +236,7 @@ class CastBaseTest : public FunctionBaseTest {
    testCast(fromType, toType, inputVector, expectedVector);
  }

-  template <typename TFrom, typename TTo>
+  template <typename TFrom>


Remove typename TTo as it is not in use. Do I need to do this refactor in a separate PR?

This is a nice change. It would be great to extract it into a separate PR to make reviewing easier and faster. Thanks.

Extract this change to #8464. Thanks.

mbasmanova · 2024-01-19T18:41:40Z

velox/type/DecimalUtil.h

@@ -146,40 +147,40 @@ class DecimalUtil {
  }

  template <typename TInput, typename TOutput>
-  inline static std::optional<TOutput> rescaleWithRoundUp(
+  inline static Status rescaleWithRoundUp(


This is PR is a bit large and complicated. Would it be possible to extract this particular change into a separate PR to make review easier?

Extract this change to #8465.

mbasmanova · 2024-01-19T18:44:18Z

velox/expression/tests/CastExprTest.cpp

+  testThrow<StringView>(
+      VARCHAR(),
+      DECIMAL(38, 0),
+      toStringViews({std::string(280, '9')}),


Instead of using toStringViews here, maybe add a testThrow<std::string>.

Notice that StringView doesn't own the string, hence, it is invalid to create StringView from a temporary std::string.

Thanks for the comment. Changed to use testThrow<std::string>.

mbasmanova

@rui-mo Rui, thank you for working on this change. I'll try to review it early next week. Please, ping me if you don't hear from me before Tue.

mbasmanova

@rui-mo Looks good to me % small comments.

mbasmanova · 2024-01-22T15:34:10Z

velox/docs/functions/spark/conversion.rst

+^^^^^^^^^^^^
+
+Casting varchar to a decimal of given precision and scale is allowed.
+Leading and trailing white-spaces in input varchars are allowed.


This description is much shorter than the one in presto directory. Is this because the behavior is just like in Presto with a single exception of allowing leading and trailing spaces? It would be nice to clarify.

That's the case, so I assume we don't need add duplicate description. Clarify in the description.

mbasmanova · 2024-01-22T15:35:22Z

velox/expression/CastExpr-inl.h

+/// | 1.5 | 1 | 5 | nullopt | 1 |
+/// | -1.5 | 1 | 5 | nullopt | -1 |
+/// | 31.523e-2 | 31 | 523 | -2 | 1 |
+struct DecimalComponents {


perhaps, put this into a 'detail' namespace

same for other implementation-only code

Moved them to detail namespace. Thanks.

mbasmanova · 2024-01-22T15:36:42Z

velox/expression/CastExpr-inl.h

+      break;
+    }
+  }
+  out = std::string_view(s + start, pos - start);


Would it be cleaner to return std::string_view?

Updated, thanks.

mbasmanova · 2024-01-22T15:38:50Z

velox/expression/CastExpr-inl.h

+      // The exponent part only contains sign.
+      return std::nullopt;
+    }
+    // Make sure all chars after sign are digits, as folly does not prevent


as folly does not prevent leading and trailing whitespaces.

maybe, as folly::tryTo allows leading and trailing whitespaces.

mbasmanova · 2024-01-22T15:40:00Z

velox/expression/CastExpr-inl.h

+      }
+    }
+    const auto tryExp =
+        folly::tryTo<int32_t>(folly::StringPiece(s + pos, size - pos));


Since we know there cannot be any errors, perhaps, use folly::try.

BTW, what happens if the number is too large (too many digits)?

Changed to use folly::to here.

what happens if the number is too large (too many digits)?

If it is too large to fit in a int128_t value, overflow occurs in parseHugeInt method. If it can be converted to a int128_t value, but cannot be reprensented with given precision and scale, overflow occurs in rescaleWithRoundUp method. Tests are at https://github.com/facebookincubator/velox/pull/5307/files#diff-f41a5cc1aa75bd9ae99c7ed93b7a6fe859f3796183fa45969032d5836b9f19e8R1830-R1888.

mbasmanova · 2024-01-22T15:42:14Z

velox/expression/CastExpr-inl.h

+    T& decimalValue) {
+  const auto decimalComponentsOpt = parseDecimalComponents(s.data(), s.size());
+  if (!decimalComponentsOpt.has_value()) {
+    return Status::UserError("Value is not a number.");


perhaps, include the value in the error message to help troubleshooting.

The full error message is like below, which contains the value. Do we need to add value here?

Cannot cast VARCHAR '1.23 ' to DECIMAL(38, 0). Value is not a number.

Got it. Thank you for clarifying. Looks good.

Actually, how would a user understand why 1.23 is not a number?

For this case, it is because of the trailing white-space. Do we need to add the detailed reason?

It would be nice to make error message as clear as possible to avoid unnecessary support calls.

Thanks. Updated. The error message now contains status message from parseDecimalComponent.

mbasmanova · 2024-01-22T15:50:37Z

velox/expression/CastExpr-inl.h

+
+/// Multiples the output by required power of 10, and adds the parsed value of
+/// input with the rescaled ouput.
+bool shiftAndAdd(std::string_view input, int128_t& out) {


This function is called only twice. First time out is zero. When out is zero there is no need to multiple by 10 ^ len(input).

Thanks for the pointer. I removed this function, and combined the two calls as parseHugeInt. The new function creates a huge int from decimal components.

rui-mo · 2024-01-24T05:13:39Z

@mbasmanova Above comments were fixed. Could you review again? Thanks!

mbasmanova

Thanks.

facebook-github-bot · 2024-01-24T12:06:43Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

mbasmanova · 2024-01-24T16:05:02Z

@rui-mo Would you rebase?

facebook-github-bot · 2024-01-24T21:10:47Z

@mbasmanova merged this pull request in 077fd73.

conbench-facebook · 2024-01-24T21:38:54Z

Conbench analyzed the 1 benchmark run on commit 077fd735.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

rui-mo · 2024-01-25T00:48:42Z

Thank you all for your great help!

mbasmanova · 2024-01-25T05:45:23Z

@rui-mo Thank you for the contribution. It was quite a lot of work. Much appreciated.

jinchengchenghh · 2024-01-26T09:22:41Z

Many thanks for your kind and warm help！@rui-mo

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 19, 2023

Yuhta self-requested a review June 20, 2023 14:12

jinchengchenghh force-pushed the cast branch 2 times, most recently from bf1821f to 8587a2b Compare June 30, 2023 02:53

jinchengchenghh mentioned this pull request Jul 24, 2023

Add CAST(double as decimal) #5767

Closed

mbasmanova reviewed Jul 24, 2023

View reviewed changes

mbasmanova requested review from majetideepak and bikramSingh91 July 24, 2023 12:22

majetideepak reviewed Jul 26, 2023

View reviewed changes

jinchengchenghh force-pushed the cast branch from 8587a2b to aa96fbe Compare August 3, 2023 06:26

jinchengchenghh force-pushed the cast branch from aa96fbe to 287a8a5 Compare August 11, 2023 06:46

jinchengchenghh changed the title ~~Add support to cast varchar to decimal~~ Add CAST(varchar as decimal) Aug 11, 2023

jinchengchenghh force-pushed the cast branch from d128bf1 to f056569 Compare August 11, 2023 08:00

majetideepak reviewed Aug 14, 2023

View reviewed changes

majetideepak approved these changes Aug 21, 2023

View reviewed changes

mbasmanova reviewed Aug 22, 2023

View reviewed changes

rui-mo force-pushed the cast branch 3 times, most recently from 01a454c to b2abad7 Compare January 19, 2024 13:11

rui-mo reviewed Jan 19, 2024

View reviewed changes

mbasmanova reviewed Jan 19, 2024

View reviewed changes

rui-mo force-pushed the cast branch from b2abad7 to 92e1c1c Compare January 22, 2024 04:41

mbasmanova approved these changes Jan 22, 2024

View reviewed changes

rui-mo force-pushed the cast branch 2 times, most recently from e3d5d81 to 8b5bcd6 Compare January 23, 2024 13:09

jinchengchenghh and others added 2 commits January 24, 2024 09:28

Support CAST(varchar as decimal)

4a9cecd

Fix comments

8acee32

rui-mo force-pushed the cast branch 2 times, most recently from 3ac4a7c to 8acee32 Compare January 24, 2024 02:23

mbasmanova approved these changes Jan 24, 2024

View reviewed changes

Yuhta approved these changes Jan 24, 2024

View reviewed changes

facebook-github-bot closed this in 077fd73 Jan 24, 2024

facebook-github-bot added the Merged label Jan 24, 2024

FelixYBW mentioned this pull request Jul 25, 2024

Spark sql avg agg function support decimal #6020

Closed

Add CAST(varchar as decimal) #5307

Add CAST(varchar as decimal) #5307

Conversation

jinchengchenghh commented Jun 19, 2023 • edited Loading

netlify bot commented Jun 19, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

jinchengchenghh commented Jun 20, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

rui-mo commented Jul 25, 2023

majetideepak left a comment

Choose a reason for hiding this comment

jinchengchenghh commented Jul 27, 2023

rui-mo commented Jul 27, 2023

jinchengchenghh commented Jul 27, 2023 • edited Loading

majetideepak commented Jul 27, 2023 • edited Loading

jinchengchenghh commented Aug 14, 2023

jinchengchenghh commented Aug 21, 2023

majetideepak left a comment

Choose a reason for hiding this comment

jinchengchenghh commented Aug 22, 2023

mbasmanova left a comment

Choose a reason for hiding this comment

jinchengchenghh commented Aug 22, 2023

jinchengchenghh commented Aug 22, 2023

mbasmanova commented Aug 22, 2023

Choose a reason for hiding this comment

rui-mo commented Jan 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

mbasmanova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rui-mo commented Jan 24, 2024

mbasmanova left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 24, 2024

mbasmanova commented Jan 24, 2024

facebook-github-bot commented Jan 24, 2024

conbench-facebook bot commented Jan 24, 2024

rui-mo commented Jan 25, 2024

mbasmanova commented Jan 25, 2024

jinchengchenghh commented Jan 26, 2024

jinchengchenghh commented Jun 19, 2023 •

edited

Loading

netlify bot commented Jun 19, 2023 •

edited

Loading

jinchengchenghh commented Jul 27, 2023 •

edited

Loading

majetideepak commented Jul 27, 2023 •

edited

Loading

rui-mo Jan 23, 2024 •

edited

Loading