[SPARK-30113][SQL][Python] Expose mergeSchema option in PySpark's ORC APIs #26755

nchammas · 2019-12-04T02:11:37Z

What changes were proposed in this pull request?

This PR is a follow-up to #24043 and cousin of #26730. It exposes the mergeSchema option directly in the ORC APIs.

Why are the changes needed?

So the Python API matches the Scala API.

Does this PR introduce any user-facing change?

Yes, it adds a new option directly in the ORC reader method signatures.

How was this patch tested?

I tested this manually as follows:

>>> spark.range(3).write.orc('test-orc')
>>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested')
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True)
DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False)
DataFrame[id: bigint]
>>> spark.conf.set('spark.sql.orc.mergeSchema', True)
>>> spark.read.orc('test-orc', recursiveFileLookup=True)
DataFrame[id: bigint, name: bigint]
>>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False)
DataFrame[id: bigint]

SparkQA · 2019-12-04T02:41:21Z

Test build #114816 has finished for PR 26755 at commit 5e324af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-04T02:44:05Z

Merged to master.

… APIs ### What changes were proposed in this pull request? This PR is a follow-up to apache#24043 and cousin of apache#26730. It exposes the `mergeSchema` option directly in the ORC APIs. ### Why are the changes needed? So the Python API matches the Scala API. ### Does this PR introduce any user-facing change? Yes, it adds a new option directly in the ORC reader method signatures. ### How was this patch tested? I tested this manually as follows: ``` >>> spark.range(3).write.orc('test-orc') >>> spark.range(3).withColumnRenamed('id', 'name').write.orc('test-orc/nested') >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] >>> spark.conf.set('spark.sql.orc.mergeSchema', True) >>> spark.read.orc('test-orc', recursiveFileLookup=True) DataFrame[id: bigint, name: bigint] >>> spark.read.orc('test-orc', recursiveFileLookup=True, mergeSchema=False) DataFrame[id: bigint] ``` Closes apache#26755 from nchammas/SPARK-30113-ORC-mergeSchema. Authored-by: Nicholas Chammas <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

expose mergeSchema in Python ORC APIs

5e324af

HyukjinKwon approved these changes Dec 4, 2019

View reviewed changes

HyukjinKwon closed this in c8922d9 Dec 4, 2019

nchammas deleted the SPARK-30113-ORC-mergeSchema branch December 4, 2019 02:45

nchammas mentioned this pull request Dec 20, 2019

[SPARK-30128][DOCS][PYTHON][SQL] Document/promote 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC #26958

Closed

zero323 mentioned this pull request Jan 7, 2020

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

dongjoon-hyun added the SQL label Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30113][SQL][Python] Expose mergeSchema option in PySpark's ORC APIs #26755

[SPARK-30113][SQL][Python] Expose mergeSchema option in PySpark's ORC APIs #26755

nchammas commented Dec 4, 2019

SparkQA commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019

[SPARK-30113][SQL][Python] Expose mergeSchema option in PySpark's ORC APIs #26755

[SPARK-30113][SQL][Python] Expose mergeSchema option in PySpark's ORC APIs #26755

Conversation

nchammas commented Dec 4, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 4, 2019

HyukjinKwon commented Dec 4, 2019