-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16472][SQL] Force user specified schema to the nullable one #14124
Conversation
Hi @gatorsmile and @marmbrus, I saw the discussion and found you are related with this one. Could you please review this? |
@@ -879,6 +879,19 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes | |||
} | |||
} | |||
} | |||
|
|||
test("Check if fields in the schema are nullable") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is forcing the schema as nullable but it has no tests. So the tests were added.
Test build #62049 has finished for PR 14124 at commit
|
Test build #62053 has finished for PR 14124 at commit
|
Test build #62054 has finished for PR 14124 at commit
|
val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
val schema = StructType(StructField("a", IntegerType, nullable = false) :: Nil)
val df = spark.read.schema(schema).json(rdd)
df.show() When user-specified schemas are not nullable and the data contains null, the null value in the result becomes |
Ah, yes it seems a bug to me. I thought it throws an exception in that case or works fine after thid PR. I will test this before/after this PR. Thanks! |
Oh, I see, before this patch
after this patch
FYI, currently (before this patch) the code with val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : null}"))
val schema = StructType(StructField("a", StringType, nullable = false) :: Nil)
val df = spark.read.schema(schema).json(rdd)
df.show() is being failed with the exception below:
It seems an unexpected behaviour anyway (I think at least we should fix the error message). I will submit a patch if this one is decided not worth being added. Thanks @gatorsmile again! |
@HyukjinKwon No matter whether this PR is merged or not, I still think we should fix the above issue. Silent conversion does not look good to me. |
@gatorsmile I am a bit confused if we are allowed to read JSON (via But, yea, I think I agree that it is a potential problem anyway (even if the case above is not allowed.) |
@HyukjinKwon Your patch solves this inconsistency by forcing schema as nullable at all. However, looks like the parquet case is for compatibility, is this the same for json? If no, why we want to do this? |
@viirya Thanks for your comment! Actually, that's what I want to have some feedback about from @marmbrus . It seems forcing to a nullable schema all is already happening, (see here) for all data sources implementing So, actually, the purpose of this PR is, to make all read APIs consistent. The reason to make them consistent in a way that the schema is forced as nullable, is what he said in the mailing list.
Actually, apparently, Parquet also reads and writes the schema with nullability correctly if we get rid of @marmbrus Do you mind if I ask to clarify here please? I think we might have to deal with this as a datasource-specific problem. |
Could you take a look please @marmbrus ? |
gentle ping @marmbrus |
What will happen if the given schema is wrong? It seems weird that we allow users to provide schema while reading the data, but without validating it. |
Thanks for feedback @cloud-fan ! If the user-given schema is wrong, it is handled differently for each datasource.
Should we disallow specifying schemas for these (maybe ORC and Parquet)? |
BTW, actually, this PR is not only about user-given schema. Currently, it always reads data into dataframe by datasources based on However, this does not happen when reading for streaming by the datasources ( So, this PR tries to make them all ignore the nullability in schema to be consistent. |
@cloud-fan If nullability should be not ignored, then I can fix this PR to make them consistent to not ignoring it (and of course I will try to identify the related problems). In this case, I will work on what @gatorsmile pointed out in #14124 (comment) about JSON (and will check the other data sources as well). I will follow your decision. To cut all the comments above short, (for other reviewers),
|
Test build #64066 has finished for PR 14124 at commit
|
Test build #64631 has finished for PR 14124 at commit
|
ffacb55
to
bf56186
Compare
Test build #65360 has finished for PR 14124 at commit
|
f6be52b
to
0bc06c6
Compare
Test build #65747 has finished for PR 14124 at commit
|
0bc06c6
to
3f153a3
Compare
Test build #67029 has finished for PR 14124 at commit
|
Sorry for the delay. After thinking it again, I think it doesn't make sense to allow users to specify the nullability when reading a data source, as Spark SQL can't validate it. How about we turn schema to nullable in |
Thanks @cloud-fan, sure, that sounds great. |
Oh wait, @cloud-fan, it seems, at least, Parquet files could possibly be written with not nullable fields. So, reading it without user-specified schema might also cause the inconsistency between the schema read from structured streaming and the one read from filed sources. If you are not sure of this, I am fine with turning the schema into nullable in |
3f153a3
to
7306937
Compare
Actually, nvm. I think handling this in |
7306937
to
d240c0d
Compare
Test build #68474 has finished for PR 14124 at commit
|
Test build #68477 has finished for PR 14124 at commit
|
d240c0d
to
1abaf1b
Compare
Test build #70922 has finished for PR 14124 at commit
|
cc @liancheng , what do you think about the nullability change? |
#17293 added data validation using schema information for Parquet Reader, as @gatorsmile suggested in https://www.mail-archive.com/[email protected]/msg39233.html. |
Let me rather close this for a while. I will reopen if it looks worth. I think the same subject could be discussed in @kiszk 's PR. |
What changes were proposed in this pull request?
This PR proposes to force the user-specified schema to the nullable one, as Spark SQL can't validate it.
How was this patch tested?
Unit tests adde in
FileStreamSourceSuite.scala
andDataFrameReaderWriterSuite.scala
.