Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8234][SQL] misc function: md5 #6779

Closed
wants to merge 9 commits into from
Closed

Conversation

qiansl127
Copy link
Contributor

No description provided.

@chenghao-intel
Copy link
Contributor

@rxin @liancheng can you trigger the unit test?

@rxin
Copy link
Contributor

rxin commented Jun 12, 2015

Jenkins, ok to test.

override def dataType: DataType = StringType

override def expectedChildTypes: Seq[DataType] =
if (child.dataType == BinaryType) Seq(BinaryType) else Seq(StringType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't make sense -- since it is using child's data type to check data type... you should remove ExpectsInputTypes and explicitly define the type check

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34756 has finished for PR 6779 at commit b7efa2a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Md5(child: Expression)
    • case class SetInFilter[T <: Comparable[T]](

val s1 = 'a.string.at(0)
val s2 = 'a.binary.at(0)
checkEvaluation(Md5(s1), "902fbdd2b1df0c4f70b4a5d23525e932", create_row("ABC"))
checkEvaluation(Md5(s2), "6ac1e56bc78f031059be7be854522c4c",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test would be simpler/clearer if you just pass literals to the expression. There are other tests that make sure that we can extract values from rows.

Additionally It would be good to test some edge cases: an empty string, null, etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for edge cases.

@qiansl127 you can write code like

checkEvaluation(Md5(Literal("ABC"), "902fbdd2b1df0c4f70b4a5d23525e932")
checkEvaluation(Md5(Literal.create(null, BinaryType), "...")
checkEvaluation(Md5(Literal.create(null, StringType), "...")

@SparkQA
Copy link

SparkQA commented Jun 12, 2015

Test build #34765 has finished for PR 6779 at commit 4d66fd1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Md5(child: Expression) extends UnaryExpression
    • case class SetInFilter[T <: Comparable[T]](

//////////////////////////////////////////////////////////////////////////////////////////////

/**
* Calculates the MD5 digest and returns the value as a 32 character hex string.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I set a bad example in strlen, we'd better insert an empty line after the doc and right before the @ group

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34907 has finished for PR 6779 at commit d77a806.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34908 has finished for PR 6779 at commit efa8b93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Md5(child: Expression) extends UnaryExpression

@qiansl127
Copy link
Contributor Author

@rxin @chenghao-intel

@SparkQA
Copy link

SparkQA commented Jun 15, 2015

Test build #34922 has finished for PR 6779 at commit d9a28e4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Md5(child: Expression)

@SparkQA
Copy link

SparkQA commented Jun 16, 2015

Test build #34964 has finished for PR 6779 at commit c75bb09.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Md5(child: Expression)

@qiansl127
Copy link
Contributor Author

@rxin @chenghao-intel Any thing need to be improved?

@adrian-wang
Copy link
Contributor

add a test using sql(), and maybe change the filename of misc.scala to miscFunctions.scala

LGTM otherwise.


override def expectedChildTypes: Seq[DataType] = Seq(BinaryType)

override def checkInputDataTypes(): TypeCheckResult =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This override is not necessary now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we can remove this function overriding, as we implemented the expectedChildTypes already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please see doc for ExpectsInputTypes. We don't need to do type check for it.

@SparkQA
Copy link

SparkQA commented Jun 17, 2015

Test build #35013 has finished for PR 6779 at commit da60eb3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Md5(child: Expression)

@chenghao-intel
Copy link
Contributor

LGTM, cc @rxin

}
}

override def toString: String = s"MD5($child)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you need the toString here ... since case class already handles it.

@rxin
Copy link
Contributor

rxin commented Jun 18, 2015

Looks good - but please add a codegen version of this.

override def genCode(ctx: CodeGenContext, ev: GeneratedExpressionCode): String = {
defineCodeGen(ctx, ev, c =>
s"""org.apache.spark.unsafe.types.UTF8String.fromString
|(org.apache.commons.codec.digest.DigestUtils.md5Hex($c))""".stripMargin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new line here? maybe just

"abcd" +
  "efgh"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably not a big problem, as we don't want to make the generated code more than 100 characters per line, right? Or @qiansl127 can update that in the follow up PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have called stripMargin, so it will still result in one line.
Anyway I think the current code is acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the generated code can be as long as we want.

in this case indenting this way is really weird. we should indent it this way

"org.apache.spark.unsafe.types.UTF8String.fromString" +
  s"(org.apache.commons.codec.digest.DigestUtils.md5Hex($c))"

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35210 has finished for PR 6779 at commit 04bd27b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Md5(child: Expression)

@chenghao-intel
Copy link
Contributor

@rxin I think this is ready to be merged.

@rxin
Copy link
Contributor

rxin commented Jun 19, 2015

Can you fix the indent? LGTM other than that.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35228 has finished for PR 6779 at commit 11fcdb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Md5(child: Expression)

@qiansl127
Copy link
Contributor Author

@rxin Done!

@rxin
Copy link
Contributor

rxin commented Jun 19, 2015

Thanks. Merging.

@asfgit asfgit closed this in 0c32fc1 Jun 19, 2015
@qiansl127 qiansl127 deleted the MD5 branch June 30, 2015 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants