[SPARK-30091][SQL][Python] Document mergeSchema option directly in the PySpark Parquet APIs #26730

nchammas · 2019-12-02T01:04:39Z

What changes were proposed in this pull request?

This change properly documents the mergeSchema option directly in the Python APIs for reading Parquet data.

Why are the changes needed?

The docstring for DataFrameReader.parquet() mentions mergeSchema but doesn't show it in the API. It seems like a simple oversight.

Before this PR, you'd have to do this to use mergeSchema:

spark.read.option('mergeSchema', True).parquet('test-parquet').show()

After this PR, you can use the option as (I believe) it was intended to be used:

spark.read.parquet('test-parquet', mergeSchema=True).show()

Does this PR introduce any user-facing change?

Yes, this PR changes the signatures of DataFrameReader.parquet() and DataStreamReader.parquet() to match their docstrings.

How was this patch tested?

Testing the mergeSchema option directly seems to be left to the Scala side of the codebase. I tested my change manually to confirm the API works.

I also confirmed that setting spark.sql.parquet.mergeSchema at the session does not get overridden by leaving mergeSchema at its default when calling parquet():

>>> spark.conf.set('spark.sql.parquet.mergeSchema', True)
>>> spark.range(3).write.parquet('test-parquet/id')
>>> spark.range(3).withColumnRenamed('id', 'name').write.parquet('test-parquet/name')
>>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet').show()
+----+----+
|  id|name|
+----+----+
|null|   1|
|null|   2|
|null|   0|
|   1|null|
|   2|null|
|   0|null|
+----+----+
>>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', mergeSchema=False).show()
+----+
|  id|
+----+
|null|
|null|
|null|
|   1|
|   2|
|   0|
+----+

SparkQA · 2019-12-02T01:34:51Z

Test build #114693 has finished for PR 26730 at commit da50864.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-02T01:49:39Z

Seems fine but mind filing a JIRA please @nchammas?

nchammas · 2019-12-02T02:12:31Z

python/pyspark/sql/readwriter.py

@@ -300,18 +300,20 @@ def table(self, tableName):
        return self._df(self._jreader.table(tableName))

    @since(1.4)
-    def parquet(self, *paths):


Side question for you @HyukjinKwon: The *paths parameter bothers me a bit. None of the other load methods use this pattern, and the streaming version of parquet() doesn't use it either. How would you feel about a separate PR changing this to paths? I suppose the 3.0 release would be our chance to do it, since it changes the API.

I think this *paths allows a consistent way with Scala and Java side (def parquet(paths: String*)). So I think technically it's more correct to support *paths.

In case of streaming, it's also matched to Scala / Java side def parquet(path: String).

Maybe we should introduce keyword-only argument (as you said earlier somewhere) after completely dropping Python 2 in Spark 3.1. ... I am not sure about this yet.

HyukjinKwon · 2019-12-04T01:12:32Z

Ah .. conflict. .. can you resolve it @nchammas please?

HyukjinKwon

LGTM if the conflicts are resolved.

…-merge-schema

SparkQA · 2019-12-04T02:08:42Z

Test build #114812 has finished for PR 26730 at commit 33770b5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-04T02:31:42Z

Merged to master.

… APIs ### What changes were proposed in this pull request? This PR is a follow-up to #24043 and cousin of #26730. It exposes the `mergeSchema` option directly in the ORC APIs. ### Why are the changes needed? So the Python API matches the Scala API. ### Does this PR introduce any user-facing change? Yes, it adds a new option directly in the ORC reader method signatures. ### How was this patch tested? I tested this manually as follows: ``` >>> spark.range(3).write.orc('test-orc') >>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested') >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] >>> spark.conf.set('spark.sql.orc.mergeSchema', True) >>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] ``` Closes #26755 from nchammas/SPARK-30113-ORC-mergeSchema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…e PySpark Parquet APIs ### What changes were proposed in this pull request? This change properly documents the `mergeSchema` option directly in the Python APIs for reading Parquet data. ### Why are the changes needed? The docstring for `DataFrameReader.parquet()` mentions `mergeSchema` but doesn't show it in the API. It seems like a simple oversight. Before this PR, you'd have to do this to use `mergeSchema`: ```python spark.read.option('mergeSchema', True).parquet('test-parquet').show() ``` After this PR, you can use the option as (I believe) it was intended to be used: ```python spark.read.parquet('test-parquet', mergeSchema=True).show() ``` ### Does this PR introduce any user-facing change? Yes, this PR changes the signatures of `DataFrameReader.parquet()` and `DataStreamReader.parquet()` to match their docstrings. ### How was this patch tested? Testing the `mergeSchema` option directly seems to be left to the Scala side of the codebase. I tested my change manually to confirm the API works. I also confirmed that setting `spark.sql.parquet.mergeSchema` at the session does not get overridden by leaving `mergeSchema` at its default when calling `parquet()`: ``` >>> spark.conf.set('spark.sql.parquet.mergeSchema', True) >>> spark.range(3).write.parquet('test-parquet/id') >>> spark.range(3).withColumnRenamed('id', 'name').write.parquet('test-parquet/name') >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet').show() +----+----+ | id|name| +----+----+ |null| 1| |null| 2| |null| 0| | 1|null| | 2|null| | 0|null| +----+----+ >>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', mergeSchema=False).show() +----+ | id| +----+ |null| |null| |null| | 1| | 2| | 0| +----+ ``` Closes apache#26730 from nchammas/parquet-merge-schema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

… APIs ### What changes were proposed in this pull request? This PR is a follow-up to apache#24043 and cousin of apache#26730. It exposes the `mergeSchema` option directly in the ORC APIs. ### Why are the changes needed? So the Python API matches the Scala API. ### Does this PR introduce any user-facing change? Yes, it adds a new option directly in the ORC reader method signatures. ### How was this patch tested? I tested this manually as follows: ``` >>> spark.range(3).write.orc('test-orc') >>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested') >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] >>> spark.conf.set('spark.sql.orc.mergeSchema', True) >>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] ``` Closes apache#26755 from nchammas/SPARK-30113-ORC-mergeSchema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

document mergeSchema directly in API

da50864

nchammas changed the title ~~Document mergeSchema option directly in the Python API~~ [SQL][Python] Document mergeSchema option directly in the Python API Dec 2, 2019

nchammas changed the title ~~[SQL][Python] Document mergeSchema option directly in the Python API~~ [SPARK-30091][SQL][Python] Document mergeSchema option directly in the Python API Dec 2, 2019

nchammas commented Dec 2, 2019

View reviewed changes

HyukjinKwon approved these changes Dec 4, 2019

View reviewed changes

Merge branch 'master' of https://github.com/apache/spark into parquet…

33770b5

…-merge-schema

nchammas mentioned this pull request Dec 4, 2019

[SPARK-30113][SQL][Python] Expose mergeSchema option in PySpark's ORC APIs #26755

Closed

nchammas changed the title ~~[SPARK-30091][SQL][Python] Document mergeSchema option directly in the Python API~~ [SPARK-30091][SQL][Python] Document mergeSchema option directly in the PySpark Parquet APIs Dec 4, 2019

HyukjinKwon closed this in e766a32 Dec 4, 2019

nchammas deleted the parquet-merge-schema branch December 4, 2019 02:38

nchammas mentioned this pull request Dec 20, 2019

[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC #26958

Closed

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

dongjoon-hyun added the SQL label Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30091][SQL][Python] Document mergeSchema option directly in the PySpark Parquet APIs #26730

[SPARK-30091][SQL][Python] Document mergeSchema option directly in the PySpark Parquet APIs #26730

nchammas commented Dec 2, 2019 •

edited

Loading

SparkQA commented Dec 2, 2019

HyukjinKwon commented Dec 2, 2019

nchammas Dec 2, 2019

HyukjinKwon Dec 2, 2019 •

edited

Loading

HyukjinKwon commented Dec 4, 2019

HyukjinKwon left a comment

SparkQA commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

[SPARK-30091][SQL][Python] Document mergeSchema option directly in the PySpark Parquet APIs #26730

[SPARK-30091][SQL][Python] Document mergeSchema option directly in the PySpark Parquet APIs #26730

Conversation

nchammas commented Dec 2, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 2, 2019

HyukjinKwon commented Dec 2, 2019

nchammas Dec 2, 2019

Choose a reason for hiding this comment

HyukjinKwon Dec 2, 2019 • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Dec 4, 2019

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

nchammas commented Dec 2, 2019 •

edited

Loading

HyukjinKwon Dec 2, 2019 •

edited

Loading