Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30091][SQL][Python] Document mergeSchema option directly in the PySpark Parquet APIs #26730

Closed
wants to merge 2 commits into from

Conversation

nchammas
Copy link
Contributor

@nchammas nchammas commented Dec 2, 2019

What changes were proposed in this pull request?

This change properly documents the mergeSchema option directly in the Python APIs for reading Parquet data.

Why are the changes needed?

The docstring for DataFrameReader.parquet() mentions mergeSchema but doesn't show it in the API. It seems like a simple oversight.

Before this PR, you'd have to do this to use mergeSchema:

spark.read.option('mergeSchema', True).parquet('test-parquet').show()

After this PR, you can use the option as (I believe) it was intended to be used:

spark.read.parquet('test-parquet', mergeSchema=True).show()

Does this PR introduce any user-facing change?

Yes, this PR changes the signatures of DataFrameReader.parquet() and DataStreamReader.parquet() to match their docstrings.

How was this patch tested?

Testing the mergeSchema option directly seems to be left to the Scala side of the codebase. I tested my change manually to confirm the API works.

I also confirmed that setting spark.sql.parquet.mergeSchema at the session does not get overridden by leaving mergeSchema at its default when calling parquet():

>>> spark.conf.set('spark.sql.parquet.mergeSchema', True)
>>> spark.range(3).write.parquet('test-parquet/id')
>>> spark.range(3).withColumnRenamed('id', 'name').write.parquet('test-parquet/name')
>>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet').show()
+----+----+
|  id|name|
+----+----+
|null|   1|
|null|   2|
|null|   0|
|   1|null|
|   2|null|
|   0|null|
+----+----+
>>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', mergeSchema=False).show()
+----+
|  id|
+----+
|null|
|null|
|null|
|   1|
|   2|
|   0|
+----+

@nchammas nchammas changed the title Document mergeSchema option directly in the Python API [SQL][Python] Document mergeSchema option directly in the Python API Dec 2, 2019
@SparkQA
Copy link

SparkQA commented Dec 2, 2019

Test build #114693 has finished for PR 26730 at commit da50864.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Seems fine but mind filing a JIRA please @nchammas?

@nchammas nchammas changed the title [SQL][Python] Document mergeSchema option directly in the Python API [SPARK-30091][SQL][Python] Document mergeSchema option directly in the Python API Dec 2, 2019
@@ -300,18 +300,20 @@ def table(self, tableName):
return self._df(self._jreader.table(tableName))

@since(1.4)
def parquet(self, *paths):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side question for you @HyukjinKwon: The *paths parameter bothers me a bit. None of the other load methods use this pattern, and the streaming version of parquet() doesn't use it either. How would you feel about a separate PR changing this to paths? I suppose the 3.0 release would be our chance to do it, since it changes the API.

Copy link
Member

@HyukjinKwon HyukjinKwon Dec 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this *paths allows a consistent way with Scala and Java side (def parquet(paths: String*)). So I think technically it's more correct to support *paths.

In case of streaming, it's also matched to Scala / Java side def parquet(path: String).

Maybe we should introduce keyword-only argument (as you said earlier somewhere) after completely dropping Python 2 in Spark 3.1. ... I am not sure about this yet.

@HyukjinKwon
Copy link
Member

Ah .. conflict. .. can you resolve it @nchammas please?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if the conflicts are resolved.

@SparkQA
Copy link

SparkQA commented Dec 4, 2019

Test build #114812 has finished for PR 26730 at commit 33770b5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nchammas nchammas changed the title [SPARK-30091][SQL][Python] Document mergeSchema option directly in the Python API [SPARK-30091][SQL][Python] Document mergeSchema option directly in the PySpark Parquet APIs Dec 4, 2019
@HyukjinKwon
Copy link
Member

Merged to master.

@nchammas nchammas deleted the parquet-merge-schema branch December 4, 2019 02:38
HyukjinKwon pushed a commit that referenced this pull request Dec 4, 2019
… APIs

### What changes were proposed in this pull request?

This PR is a follow-up to #24043 and cousin of #26730. It exposes the `mergeSchema` option directly in the ORC APIs.

### Why are the changes needed?

So the Python API matches the Scala API.

### Does this PR introduce any user-facing change?

Yes, it adds a new option directly in the ORC reader method signatures.

### How was this patch tested?

I tested this manually as follows:

```
>>> spark.range(3).write.orc('test-orc')
>>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested')
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True)
DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False)
DataFrame[id: bigint]
>>> spark.conf.set('spark.sql.orc.mergeSchema', True)
>>> spark.read.orc('test-orc', recursiveFileLookup=True)
DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False)
DataFrame[id: bigint]
```

Closes #26755 from nchammas/SPARK-30113-ORC-mergeSchema.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…e PySpark Parquet APIs

### What changes were proposed in this pull request?

This change properly documents the `mergeSchema` option directly in the Python APIs for reading Parquet data.

### Why are the changes needed?

The docstring for `DataFrameReader.parquet()` mentions `mergeSchema` but doesn't show it in the API. It seems like a simple oversight.

Before this PR, you'd have to do this to use `mergeSchema`:

```python
spark.read.option('mergeSchema', True).parquet('test-parquet').show()
```

After this PR, you can use the option as (I believe) it was intended to be used:

```python
spark.read.parquet('test-parquet', mergeSchema=True).show()
```

### Does this PR introduce any user-facing change?

Yes, this PR changes the signatures of `DataFrameReader.parquet()` and `DataStreamReader.parquet()` to match their docstrings.

### How was this patch tested?

Testing the `mergeSchema` option directly seems to be left to the Scala side of the codebase. I tested my change manually to confirm the API works.

I also confirmed that setting `spark.sql.parquet.mergeSchema` at the session does not get overridden by leaving `mergeSchema` at its default when calling `parquet()`:

```
>>> spark.conf.set('spark.sql.parquet.mergeSchema', True)
>>> spark.range(3).write.parquet('test-parquet/id')
>>> spark.range(3).withColumnRenamed('id', 'name').write.parquet('test-parquet/name')
>>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet').show()
+----+----+
|  id|name|
+----+----+
|null|   1|
|null|   2|
|null|   0|
|   1|null|
|   2|null|
|   0|null|
+----+----+
>>> spark.read.option('recursiveFileLookup', True).parquet('test-parquet', mergeSchema=False).show()
+----+
|  id|
+----+
|null|
|null|
|null|
|   1|
|   2|
|   0|
+----+
```

Closes apache#26730 from nchammas/parquet-merge-schema.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
… APIs

### What changes were proposed in this pull request?

This PR is a follow-up to apache#24043 and cousin of apache#26730. It exposes the `mergeSchema` option directly in the ORC APIs.

### Why are the changes needed?

So the Python API matches the Scala API.

### Does this PR introduce any user-facing change?

Yes, it adds a new option directly in the ORC reader method signatures.

### How was this patch tested?

I tested this manually as follows:

```
>>> spark.range(3).write.orc('test-orc')
>>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested')
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True)
DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False)
DataFrame[id: bigint]
>>> spark.conf.set('spark.sql.orc.mergeSchema', True)
>>> spark.read.orc('test-orc', recursiveFileLookup=True)
DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False)
DataFrame[id: bigint]
```

Closes apache#26755 from nchammas/SPARK-30113-ORC-mergeSchema.

Authored-by: Nicholas Chammas <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants