Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-38225][SQL] Adjust input format of function to_binary #35533

Closed
wants to merge 7 commits into from

Conversation

xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Feb 15, 2022

What changes were proposed in this pull request?

Adjust input format of function to_binary:

  • gracefully fail for the non-string format parameter
  • remove arguable base2 format support

Why are the changes needed?

Currently, function to_binary doesn't deal with the non-string format parameter properly.
For example, spark.sql("select to_binary('abc', 1)") raises casting error, rather than hint that encoding format is unsupported.

In addition, the current result of the base2 format is incorrect compared to other DBMSes: ours use Cast(expr, BinaryType) which utilizes utf-8 format, whereas DBMSes (e.g. BigQuery)' convert a BASE2-encoded string to bytes.

So we remove the base2 support now and will follow up with that later.

Does this PR introduce any user-facing change?

Yes.

  • Better error messages for non-string format parameter. For example:

From:

scala> spark.sql("select to_binary('abc', 1)")
org.apache.spark.sql.AnalysisException: class java.lang.Integer cannot be cast to class org.apache.spark.unsafe.types.UTF8String (java.lang.Integer is in module java.base of loader 'bootstrap'; org.apache.spark.unsafe.types.UTF8String is in unnamed module of loader 'app'); line 1 pos 7

To:

scala> spark.sql("select to_binary('abc', 1)")
org.apache.spark.sql.AnalysisException: cannot resolve 'to_binary('abc', 1)' due to data type mismatch: Unsupported encoding format: Some(1). The format has to be a case-insensitive string literal of 'hex', 'utf-8', 'base2', or 'base64'; line 1 pos 7;
  • Removed base2 format support
scala> spark.sql("select to_binary('abc', 'base2')").show()
org.apache.spark.sql.AnalysisException: cannot resolve 'to_binary('abc', 'base2')' due to data type mismatch: Unsupported encoding format: Some(base2). The format has to be a case-insensitive string literal of 'hex', 'utf-8', or 'base64'; line 1 pos 7;

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Feb 15, 2022
@@ -2565,15 +2565,15 @@ case class ToBinary(expr: Expression, format: Option[Expression], child: Express
case lit if lit.foldable =>
val value = lit.eval()
if (value == null) Literal(null, BinaryType)
else {
else if (value.isInstanceOf[UTF8String]) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for value to be an instance of java.lang.String?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it can only be UTF8String

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can remove this check now, as we have checked the data type ahead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! Removed.

-- !query schema
struct<>
-- !query output
org.apache.spark.sql.AnalysisException
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally failed query.

-- !query schema
struct<>
-- !query output
org.apache.spark.sql.AnalysisException
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally failed query.

@xinrong-meng xinrong-meng changed the title [WIP] Complete input validation of function to_binary [SPARK-38225][SQL] Complete input validation of function to_binary Feb 16, 2022
@xinrong-meng xinrong-meng changed the title [SPARK-38225][SQL] Complete input validation of function to_binary [SPARK-38225][SQL] Adjust input format of function to_binary Feb 16, 2022
@xinrong-meng xinrong-meng marked this pull request as ready for review February 16, 2022 08:34
@xinrong-meng
Copy link
Member Author

Would you please review it when you are free? Thanks!
@HyukjinKwon @cloud-fan @entong @gengliangwang

@@ -2565,15 +2565,14 @@ case class ToBinary(expr: Expression, format: Option[Expression], child: Express
case lit if lit.foldable =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the string type check here, as to_binary(str_col, cast(null as int)) should fail instead of returning null.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

May I confirm that the case below is acceptable?

spark-sql> select to_binary(cast(null as int), 'utf-8');
NULL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, the type coercion can cast any atomic type to string type implicitly.

@@ -2562,18 +2562,17 @@ case class ToBinary(expr: Expression, format: Option[Expression], child: Express

def this(expr: Expression, format: Expression) = this(expr, Option(format),
format match {
case lit if lit.foldable =>
case lit if (lit.foldable && Seq(StringType, NullType).contains(lit.dataType)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, do other systems like snowflake support to_binary('abc', null)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake doesn't specify that in their document, but BigQuery and Teradata support to_binary('abc', null):

Expressions, functions, and operators | BigQuery | Google Cloud "If an operand is NULL, the result is NULL, with the exception of the IS operator."

Teradata Online Documentation | Quick access to technical manuals "If either instring or in_encoding is NULL, the result is NULL."

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9fd9830 Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants