Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6037][SQL] Avoiding duplicate Parquet schema merging #4786

Closed
wants to merge 1 commit into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Feb 26, 2015

FilteringParquetRowInputFormat manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in ParquetRelation2. We don't need to re-merge them at InputFormat.

@SparkQA
Copy link

SparkQA commented Feb 26, 2015

Test build #27996 has finished for PR 4786 at commit ef78a5a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

The reason why we needed to do a separate schema merging in FilteringParquetRowInputFormat was explained in #4768. I'm not sure why removing this doesn't break the test right now. Will investigate this tomorrow. I guess #4775 made the difference.

@viirya
Copy link
Member Author

viirya commented Feb 27, 2015

@liancheng #4768 just explained why you need to do merging. The problem is, before the reading task is launched, the different schemas are already merged in ParquetRelation2. You just re-perform the merging task in FilteringParquetRowInputFormat. We just need to get the already merged schema from configuration and use it.

@liancheng
Copy link
Contributor

Oh I see, you didn't cancel the change, but reused the merged schema, makes sense, thanks! Merging to master and branch-1.3.

asfgit pushed a commit that referenced this pull request Feb 27, 2015
`FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`.

Author: Liang-Chi Hsieh <[email protected]>

Closes #4786 from viirya/dup_parquet_schemas_merge and squashes the following commits:

ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging.

(cherry picked from commit 4ad5153)
Signed-off-by: Cheng Lian <[email protected]>
@asfgit asfgit closed this in 4ad5153 Feb 27, 2015
@viirya viirya deleted the dup_parquet_schemas_merge branch December 27, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants