-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6869][PySpark] Add pyspark archives path to PYTHONPATH #5580
Conversation
Test build #30559 has started for PR 5580 at commit |
Test build #30560 has started for PR 5580 at commit |
Test build #30559 has finished for PR 5580 at commit
|
Test PASSed. |
Test build #30560 has finished for PR 5580 at commit
|
Test PASSed. |
// In yarn mode for a python app, if PYSPARK_ARCHIVES_PATH is in the user environment | ||
// add pyspark archives to files that can be distributed with the job | ||
if (args.isPython && clusterManager == YARN){ | ||
sys.env.get("PYSPARK_ARCHIVES_PATH").map { archives => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user set this himself? What does he set it too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark administrator can set pyspark's zip of local or hdfs path to PYSPARK_ARCHIVES_PATH in spark-env.sh. and then user run python application as before.
@lianhuiwang Two high level comments. First, why just put it on |
@andrewor14 First question is why not just put it on --py-files?if we donot set PYTHONPATH to executor in Client, Executor will throw a exception "/usr/bin/python: No module named pyspark" because executor can not use --py-files. so we also need to set PYTHONPATH environment in Client that can make executor run with pyspark module. |
@andrewor14 for second question,i add two things for it.one is i add zip pyspark archives to pyspark/lib when we build spark jar. other is in submit if PYSPARK_ARCHIVES_PATH does not exist and pyspark.zip does not exist, then we zip archives to pyspark/lib. |
@tgravescs i think this PR is useful for you. you can try it. |
for (sparkHome <- sys.env.get("SPARK_HOME")) { | ||
val pyLibPath = Seq(sparkHome, "python", "lib").mkString(File.separator) | ||
val pyArchivesFile = new File(pyLibPath, "pyspark.zip") | ||
if (!pyArchivesFile.exists()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we just make sure the zip is built during the build then we don't need to do the zip in the code. Just require it already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that is no effect. maybe sometime we upgrade just with coping spark.jar. so that is good for this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow, if you just upgrade spark.jar then there are no change to the python scripts so you don't need to put new pyspark.zip. If there are changes then you either need to copy over the new python scripts or put a new pyspark.zip on there. It seems putting new pyspark.zip on there would be easier. Although I guess you need the python scripts there anyway for client mode so you probably need both.
In many cases I wouldn't expect a user to have write permissions on the python/lib directory. I would expect that to be a privileged operation. In that case the zip would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i agree with you. thanks
@tgravescs yes, i agree with your comments and have updated it. can you review it again? thanks. |
Test build #31000 has started for PR 5580 at commit |
If user don't use make-distribution.sh and just compile Spark use maven or sbt, then don't have pyspark.zip. So we really don't need to do the zip in the code? |
@lianhuiwang like mentioned can we have the zip happen during the package phase rather then in make-distribution.sh? |
if (args.isPython && clusterManager == YARN) { | ||
var pyArchives: String = null | ||
if (sys.env.contains("PYSPARK_ARCHIVES_PATH")) { | ||
pyArchives = sys.env.get("PYSPARK_ARCHIVES_PATH").get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sys.env("PYSPARK_ARCHIVES_PATH")
? Or even:
sys.env.get("PYSPARK_ARCHIVES_PATH").getOrElse(
// code to figure out where the archives are
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have updated it using option, because in getOrElse(default),default must be a value, can not be expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that's not true, getOrElse
takes a closure:
def getOrElse[B >: A](default: ⇒ B): B
Which is why you can do getOrElse(throw SomeException())
(look for it in Spark's code base).
I tested this out and this is working fine for me with jdk7 for both cluster and client mode. Code looks good. My only comment is that I think we can now stop packaging the python stuff in the assembly jar. @lianhuiwang have you looked at removing those? If its not much works it would be best to do it here. If it looks like quite a bit we can file another jira for this. I would like to get this into 1.4 based on the discussions of ending support for jdk6. |
having heard nothing from @lianhuiwang I'm going to commit this and file a followup to remove python files from the assembly jar. |
Based on #5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR. andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks. Author: Lianhui Wang <[email protected]> Closes #5580 from lianhuiwang/SPARK-6869 and squashes the following commits: 66ffa43 [Lianhui Wang] Update Client.scala c2ad0f9 [Lianhui Wang] Update Client.scala 1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 20402cd [Lianhui Wang] use ZipEntry 9d87c3f [Lianhui Wang] update scala style e7bd971 [Lianhui Wang] address vanzin's comments 4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt e6b573b [Lianhui Wang] address vanzin's comments f11f84a [Lianhui Wang] zip pyspark archives 5192cca [Lianhui Wang] update import path 3b1e4c8 [Lianhui Wang] address tgravescs's comments 9396346 [Lianhui Wang] put zip to make-distribution.sh 0d2baf7 [Lianhui Wang] update import paths e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit 31e8e06 [Lianhui Wang] update code style 9f31dac [Lianhui Wang] update code and add comments f72987c [Lianhui Wang] add archives path to PYTHONPATH (cherry picked from commit ebff732) Signed-off-by: Thomas Graves <[email protected]>
Add `python/lib/pyspark.zip` to `.gitignore`. After merging #5580, `python/lib/pyspark.zip` will be generated when building Spark. Author: zsxwing <[email protected]> Closes #6017 from zsxwing/gitignore and squashes the following commits: 39b10c4 [zsxwing] Ignore python/lib/pyspark.zip
Add `python/lib/pyspark.zip` to `.gitignore`. After merging #5580, `python/lib/pyspark.zip` will be generated when building Spark. Author: zsxwing <[email protected]> Closes #6017 from zsxwing/gitignore and squashes the following commits: 39b10c4 [zsxwing] Ignore python/lib/pyspark.zip (cherry picked from commit dc71e47) Signed-off-by: Andrew Or <[email protected]>
@tgravescs sorry for my late reply. i think #6022 is working for SPARK-7485. |
…n python/pyspark As PR #5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this. Author: linweizhong <[email protected]> Closes #6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits: 8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build
…n python/pyspark As PR #5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this. Author: linweizhong <[email protected]> Closes #6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits: 8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build (cherry picked from commit 9847875) Signed-off-by: Andrew Or <[email protected]>
Based on apache#5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR. andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks. Author: Lianhui Wang <[email protected]> Closes apache#5580 from lianhuiwang/SPARK-6869 and squashes the following commits: 66ffa43 [Lianhui Wang] Update Client.scala c2ad0f9 [Lianhui Wang] Update Client.scala 1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 20402cd [Lianhui Wang] use ZipEntry 9d87c3f [Lianhui Wang] update scala style e7bd971 [Lianhui Wang] address vanzin's comments 4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt e6b573b [Lianhui Wang] address vanzin's comments f11f84a [Lianhui Wang] zip pyspark archives 5192cca [Lianhui Wang] update import path 3b1e4c8 [Lianhui Wang] address tgravescs's comments 9396346 [Lianhui Wang] put zip to make-distribution.sh 0d2baf7 [Lianhui Wang] update import paths e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit 31e8e06 [Lianhui Wang] update code style 9f31dac [Lianhui Wang] update code and add comments f72987c [Lianhui Wang] add archives path to PYTHONPATH
Add `python/lib/pyspark.zip` to `.gitignore`. After merging apache#5580, `python/lib/pyspark.zip` will be generated when building Spark. Author: zsxwing <[email protected]> Closes apache#6017 from zsxwing/gitignore and squashes the following commits: 39b10c4 [zsxwing] Ignore python/lib/pyspark.zip
…n python/pyspark As PR apache#5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this. Author: linweizhong <[email protected]> Closes apache#6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits: 8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build
Based on apache#5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR. andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks. Author: Lianhui Wang <[email protected]> Closes apache#5580 from lianhuiwang/SPARK-6869 and squashes the following commits: 66ffa43 [Lianhui Wang] Update Client.scala c2ad0f9 [Lianhui Wang] Update Client.scala 1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 20402cd [Lianhui Wang] use ZipEntry 9d87c3f [Lianhui Wang] update scala style e7bd971 [Lianhui Wang] address vanzin's comments 4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt e6b573b [Lianhui Wang] address vanzin's comments f11f84a [Lianhui Wang] zip pyspark archives 5192cca [Lianhui Wang] update import path 3b1e4c8 [Lianhui Wang] address tgravescs's comments 9396346 [Lianhui Wang] put zip to make-distribution.sh 0d2baf7 [Lianhui Wang] update import paths e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit 31e8e06 [Lianhui Wang] update code style 9f31dac [Lianhui Wang] update code and add comments f72987c [Lianhui Wang] add archives path to PYTHONPATH
Add `python/lib/pyspark.zip` to `.gitignore`. After merging apache#5580, `python/lib/pyspark.zip` will be generated when building Spark. Author: zsxwing <[email protected]> Closes apache#6017 from zsxwing/gitignore and squashes the following commits: 39b10c4 [zsxwing] Ignore python/lib/pyspark.zip
…n python/pyspark As PR apache#5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this. Author: linweizhong <[email protected]> Closes apache#6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits: 8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build
Based on apache#5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR. andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks. Author: Lianhui Wang <[email protected]> Closes apache#5580 from lianhuiwang/SPARK-6869 and squashes the following commits: 66ffa43 [Lianhui Wang] Update Client.scala c2ad0f9 [Lianhui Wang] Update Client.scala 1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869 20402cd [Lianhui Wang] use ZipEntry 9d87c3f [Lianhui Wang] update scala style e7bd971 [Lianhui Wang] address vanzin's comments 4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt e6b573b [Lianhui Wang] address vanzin's comments f11f84a [Lianhui Wang] zip pyspark archives 5192cca [Lianhui Wang] update import path 3b1e4c8 [Lianhui Wang] address tgravescs's comments 9396346 [Lianhui Wang] put zip to make-distribution.sh 0d2baf7 [Lianhui Wang] update import paths e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit 31e8e06 [Lianhui Wang] update code style 9f31dac [Lianhui Wang] update code and add comments f72987c [Lianhui Wang] add archives path to PYTHONPATH
Add `python/lib/pyspark.zip` to `.gitignore`. After merging apache#5580, `python/lib/pyspark.zip` will be generated when building Spark. Author: zsxwing <[email protected]> Closes apache#6017 from zsxwing/gitignore and squashes the following commits: 39b10c4 [zsxwing] Ignore python/lib/pyspark.zip
…n python/pyspark As PR apache#5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this. Author: linweizhong <[email protected]> Closes apache#6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits: 8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build
…very yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files
…very yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files
…very yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files
…very yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files
…very yarn node - Spark supports pyspark on yarn cluster without deploying python libraries from Spark 1.4 - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580, apache/spark#5478 Author: Jongyoul Lee <[email protected]> Closes #118 from jongyoul/ZEPPELIN-18 and squashes the following commits: a47e27c [Jongyoul Lee] - Fixed test script for spark 1.4.0 72a65fd [Jongyoul Lee] - Fixed test script for spark 1.4.0 ee6d100 [Jongyoul Lee] - Cleanup codes 47fd9c9 [Jongyoul Lee] - Cleanup codes 248e330 [Jongyoul Lee] - Cleanup codes 4cd10b5 [Jongyoul Lee] - Removed meaningless codes comments c9cda29 [Jongyoul Lee] - Removed setting SPARK_HOME - Changed the location of pyspark's directory into interpreter/spark ef240f5 [Jongyoul Lee] - Fixed typo 06002fd [Jongyoul Lee] - Fixed typo 4b35c8d [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Dummy for trigger 682986e [Jongyoul Lee] rebased 8a7bf47 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing ad610fb [Jongyoul Lee] rebased 94bdf30 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Fixed checkstyle 929333d [Jongyoul Lee] rebased 64b8195 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing 0a2d90e [Jongyoul Lee] rebased b05ae6e [Jongyoul Lee] [ZEPPELIN-18] Remove setting SPARK_HOME for PySpark - Excludes python/** from apache-rat 71e2a92 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Removed verbose setting 0ddb436 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files 1b192f6 [Jongyoul Lee] [ZEPPELIN-18] Remove setting SPARK_HOME for PySpark - Removed redundant dependency setting 32fd9e1 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing
…very yarn node - Spark supports pyspark on yarn cluster without deploying python libraries from Spark 1.4 - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580, apache/spark#5478 Author: Jongyoul Lee <[email protected]> Closes #118 from jongyoul/ZEPPELIN-18 and squashes the following commits: a47e27c [Jongyoul Lee] - Fixed test script for spark 1.4.0 72a65fd [Jongyoul Lee] - Fixed test script for spark 1.4.0 ee6d100 [Jongyoul Lee] - Cleanup codes 47fd9c9 [Jongyoul Lee] - Cleanup codes 248e330 [Jongyoul Lee] - Cleanup codes 4cd10b5 [Jongyoul Lee] - Removed meaningless codes comments c9cda29 [Jongyoul Lee] - Removed setting SPARK_HOME - Changed the location of pyspark's directory into interpreter/spark ef240f5 [Jongyoul Lee] - Fixed typo 06002fd [Jongyoul Lee] - Fixed typo 4b35c8d [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Dummy for trigger 682986e [Jongyoul Lee] rebased 8a7bf47 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing ad610fb [Jongyoul Lee] rebased 94bdf30 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Fixed checkstyle 929333d [Jongyoul Lee] rebased 64b8195 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing 0a2d90e [Jongyoul Lee] rebased b05ae6e [Jongyoul Lee] [ZEPPELIN-18] Remove setting SPARK_HOME for PySpark - Excludes python/** from apache-rat 71e2a92 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Removed verbose setting 0ddb436 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - Followed spark's way to support pyspark - https://issues.apache.org/jira/browse/SPARK-6869 - apache/spark#5580 - https://github.com/apache/spark/pull/5478/files 1b192f6 [Jongyoul Lee] [ZEPPELIN-18] Remove setting SPARK_HOME for PySpark - Removed redundant dependency setting 32fd9e1 [Jongyoul Lee] [ZEPPELIN-18] Running pyspark without deploying python libraries to every yarn node - rebasing (cherry picked from commit 3bd2b21) Signed-off-by: Lee moon soo <[email protected]>
Based on #5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR.
@andrewor14 @sryza @Sephiroth-Lin Can you take a look at this?thanks.