Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32762][SQL][TEST] Enhance the verification of ExpressionsSchemaSuite to sql-expression-schema.md #29608

Closed
wants to merge 7 commits into from

Conversation

LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Sep 1, 2020

What changes were proposed in this pull request?

sql-expression-schema.md automatically generated by ExpressionsSchemaSuite, but only expressions entries are checked in ExpressionsSchemaSuite. So if we manually modify the contents of the file,  ExpressionsSchemaSuite does not necessarily guarantee the correctness of the it some times. For example, Spark-24884 added regexp_extract_all expression support, and manually modify the sql-expression-schema.md but not change the content of Number of queries cause file content inconsistency.

Some additional checks have been added to ExpressionsSchemaSuite to improve the correctness guarantee of sql-expression-schema.md as follow:

  • Number of queries should equals size of expressions entries in sql-expression-schema.md

  • Number of expressions that missing example should equals size of Expressions missing examples in sql-expression-schema.md

  • MissExamples from case should same as expectedMissingExamples from sql-expression-schema.md

Why are the changes needed?

Ensure the correctness of sql-expression-schema.md content.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Enhanced ExpressionsSchemaSuite

@maropu
Copy link
Member

maropu commented Sep 1, 2020

ok to test

@SparkQA
Copy link

SparkQA commented Sep 1, 2020

Test build #128142 has finished for PR 29608 at commit 6d36603.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable

@@ -152,7 +152,7 @@ class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession {

val outputSize = outputs.size
val headerSize = header.size
val expectedOutputs: Seq[QueryOutput] = {
val (expectedMissExamples, expectedOutputs): (Array[String], Seq[QueryOutput]) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No big deal, but the types can just be added to the two tuple elements, instead of declaring them separately as a type for the whole tuple

Copy link
Contributor Author

@LuciferYang LuciferYang Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First changed to val (expectedMissingExamples: Array[String], expectedOutputs: Seq[QueryOutput]), but I feel the type declaration here is redundant, so changed to val (expectedMissingExamples, expectedOutputs). Is this acceptable?

@@ -161,14 +161,28 @@ class ExpressionsSchemaSuite extends QueryTest with SharedSparkSession {
s"Expected $expectedSize blocks in result file but got " +
s"${outputSize + headerSize}. Try regenerate the result files.")

Seq.tabulate(outputSize) { i =>
val numberOfQueries = lines(2).split(":")(1).trim.toInt
val numberOfMissExample = lines(3).split(":")(1).trim.toInt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miss -> missed or missing, in most cases in this change

Copy link
Contributor Author

@LuciferYang LuciferYang Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ~

// Ensure consistency of the result file.
assert(numberOfQueries == expectedOutputs.size,
s"outputs size: ${expectedOutputs.size} not same as numberOfQueries: $numberOfQueries " +
"record in result file. Try regenerate the result files.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regenerate -> regenerating

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ~

Copy link
Contributor Author

@LuciferYang LuciferYang Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ~

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Sep 2, 2020

@srowen 5c8e29e and bf0ceea the commets

@SparkQA
Copy link

SparkQA commented Sep 2, 2020

Test build #128176 has finished for PR 29608 at commit 5c8e29e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 2, 2020

Test build #128178 has finished for PR 29608 at commit bf0ceea.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay cc: @cloud-fan @beliefer


Seq.tabulate(outputSize) { i =>
val numberOfQueries = lines(2).split(":")(1).trim.toInt
val numberOfMissingExample = lines(3).split(":")(1).trim.toInt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Example -> Examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ~

@LuciferYang
Copy link
Contributor Author

The failed check cause by Install R linter dependencies and SparkR

val segments = lines(i + headerSize).split('|')
QueryOutput(
className = segments(1).trim,
funcName = segments(2).trim,
sql = segments(3).trim,
schema = segments(4).trim)
}

// Ensure consistency of the result file.
assert(numberOfQueries == expectedOutputs.size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should put these assert on line 163

Copy link
Contributor

@beliefer beliefer Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberOfQueries == outputSize
Ah, I got it. But if a new function here, expectedOutputs.size must not equal to numberOfQueries.
expectedOutputs.size == outputSize in fact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beliefer This assert place here to verify the consistency of the file contents to avoid inconsistency caused by manual modification, and I think line 189 assert expectedOutputs.size == outputSize achieves the same goal as numberOfQueries == outputSize. Is this acceptable?

Copy link
Contributor

@beliefer beliefer Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try to mock a new function with comments and test this suite.

Copy link
Contributor Author

@LuciferYang LuciferYang Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try to mock a new function with comments and test this suite.

You are right, Spark-24884 already trigger this problem cause by incomplete manual modification.

With the new function add scene, I think the right way is run this case with SPARK_GENERATE_GOLDEN_FILES = 1 to automatically regenerate the correct sql-expression-schema.md becasuse sql-expression-schema.md header said Automatically generated by ExpressionsSchemaSuite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, if the file is manually modified instead of automatically generated, I think the assertion failure caused by incorrect modification should be expected.

Do we need to tell users clearly with Try regenerating the result files with sys env SPARK_GENERATE_GOLDEN_FILES = 1?

Copy link
Contributor

@beliefer beliefer Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it. Thanks! Could you put these assert on line 163, so that it looks clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address 6e6489b reorder the assertions.

val segments = lines(i + headerSize).split('|')
QueryOutput(
className = segments(1).trim,
funcName = segments(2).trim,
sql = segments(3).trim,
schema = segments(4).trim)
}

// Ensure consistency of the result file.
assert(numberOfQueries == expectedOutputs.size,
Copy link
Contributor

@beliefer beliefer Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numberOfQueries == outputSize
Ah, I got it. But if a new function here, expectedOutputs.size must not equal to numberOfQueries.
expectedOutputs.size == outputSize in fact.

Seq.tabulate(outputSize) { i =>
val numberOfQueries = lines(2).split(":")(1).trim.toInt
val numberOfMissingExamples = lines(3).split(":")(1).trim.toInt
val missingExamples = lines(4).split(":")(1).trim.split(",")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedMissingExamples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ~

s"numberOfMissingExamples: $numberOfMissingExamples " +
"record in result file. Try regenerating the result files.")

(missingExamples, expectedOutputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done ~

@beliefer
Copy link
Contributor

beliefer commented Sep 3, 2020

Good catch! Except for some minor issues.

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128221 has finished for PR 29608 at commit cbb0fcd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128236 has finished for PR 29608 at commit 5a574c3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128222 has finished for PR 29608 at commit 85299b1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor

beliefer commented Sep 3, 2020

LGTM!

@LuciferYang
Copy link
Contributor Author

Thx for your review @maropu @srowen @beliefer ~

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128249 has finished for PR 29608 at commit 6e6489b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang
Copy link
Contributor Author

@maropu org.apache.spark.sql.hive.thriftserver.CliSuite.* failed... I think It doesn't seem to be caused by this pr, can you help trigger retest?

@maropu
Copy link
Member

maropu commented Sep 3, 2020

@maropu org.apache.spark.sql.hive.thriftserver.CliSuite.* failed... I think It doesn't seem to be caused by this pr, can you help trigger retest?

All the tests in GitHub Actions passed, so this PR looks fine.

@LuciferYang
Copy link
Contributor Author

Thx ~ @maropu

@maropu maropu closed this in 1de272f Sep 4, 2020
@maropu
Copy link
Member

maropu commented Sep 4, 2020

Thanks! Merged to master.

@LuciferYang LuciferYang deleted the sql-expression-schema branch June 6, 2022 03:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants