Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14056] Appends s3 specific configurations and spark.hadoop con… #11876

Closed
wants to merge 1 commit into from

Conversation

sitalkedia
Copy link

What changes were proposed in this pull request?

Appends s3 specific configurations and spark.hadoop configurations to hive configuration.

How was this patch tested?

Tested by running a job on cluster.

…figurations to hive configuration.


/**
* Appends s3 specific configurations and spark.hadoop configurations to hadoop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits: "S3-specific", "a Hadoop". It doesn't need to return a Configuration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.


/**
* Appends S3-specific configurations and spark.hadoop.* configurations to a Hadoop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also mention the buffer keys that are also added that don't fit into either of those.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! changed it accordingly.


/**
* Appends S3-specific, spark.hadoop.*, and spark.spark.buffer.size configurations to a Hadoop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra "spark." here in "spark.spark.buffer.size".
This might be a dumb question, but it feels kind of funny that we set these things when a Configuration is created everywhere except one place. Is it because the HiveConf necessarily comes from somewhere else that Spark isn't initializing? Just trying to rationalize treating it specially in this one case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, removed extra "spark." .

You are right, HiveConf is being initialized in a separate code path which Spark isn't initializing properly. I am not very familiar with hive side of things to comment on why it was done that way. But the TODO in TableReader.scala suggests that it is the right place to initialize the HiveConf.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking reasonable. If we're able to get a comment from @marmbrus about this particular aspect of the change (he put in the todo in 9aadcff ) maybe that would confirm that this does need a special treatment and so this change makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very old TODO, but if I remember correctly, it was a result of me omitting code as I copied logic from Shark. It would be good to understand what the job that was tested on a cluster is doing, and why it needs info in the hive conf. Just because long term we are moving further and further away from using hive code.

No objections though if this is fixing something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmbrus - the job run a simple hive query to create a table. From what I understand from the code is the hiveConf is being initialized which is does not include spark.hadoo.* configurations and that hiveConf is being used to initialized the HadoopRDD. So the HadoopRDD does not contain any spark.hadoop.* configurations. This fix is meant to resolve that issue.

@srowen
Copy link
Member

srowen commented Mar 29, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54447 has finished for PR 11876 at commit 018eea6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 30, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54515 has finished for PR 11876 at commit 018eea6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sitalkedia
Copy link
Author

Thanks @srowen, not sure why the test failed, will take a look.

@sitalkedia
Copy link
Author

@srowen - Thanks for taking a look, updated the diff to fix the test case.

@srowen
Copy link
Member

srowen commented Apr 2, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 3, 2016

Test build #54784 has finished for PR 11876 at commit 98eee85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Apr 3, 2016

Merged to master

@asfgit asfgit closed this in 1cf7018 Apr 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants