[SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17375

HyukjinKwon · 2017-03-21T09:53:06Z

What changes were proposed in this pull request?

This PR proposes to backports #16429 to branch-1.6 so that Python 3.6.0 works with Spark 1.6.x.

How was this patch tested?

Manually, via

./run-tests --python-executables=python3.6

Finished test(python3.6): pyspark.conf (5s)
Finished test(python3.6): pyspark.broadcast (7s)
Finished test(python3.6): pyspark.accumulators (9s)
Finished test(python3.6): pyspark.rdd (16s)
Finished test(python3.6): pyspark.shuffle (0s)
Finished test(python3.6): pyspark.serializers (11s)
Finished test(python3.6): pyspark.profiler (5s)
Finished test(python3.6): pyspark.context (21s)
Finished test(python3.6): pyspark.ml.clustering (12s)
Finished test(python3.6): pyspark.ml.feature (16s)
Finished test(python3.6): pyspark.ml.classification (16s)
Finished test(python3.6): pyspark.ml.recommendation (16s)
Finished test(python3.6): pyspark.ml.tuning (14s)
Finished test(python3.6): pyspark.ml.regression (16s)
Finished test(python3.6): pyspark.ml.evaluation (12s)
Finished test(python3.6): pyspark.ml.tests (17s)
Finished test(python3.6): pyspark.mllib.classification (18s)
Finished test(python3.6): pyspark.mllib.evaluation (12s)
Finished test(python3.6): pyspark.mllib.feature (19s)
Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
Finished test(python3.6): pyspark.mllib.fpm (12s)
Finished test(python3.6): pyspark.mllib.clustering (31s)
Finished test(python3.6): pyspark.mllib.random (8s)
Finished test(python3.6): pyspark.mllib.linalg.distributed (17s)
Finished test(python3.6): pyspark.mllib.recommendation (23s)
Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
Finished test(python3.6): pyspark.mllib.stat._statistics (13s)
Finished test(python3.6): pyspark.mllib.regression (22s)
Finished test(python3.6): pyspark.mllib.util (9s)
Finished test(python3.6): pyspark.mllib.tree (14s)
Finished test(python3.6): pyspark.sql.types (9s)
Finished test(python3.6): pyspark.sql.context (16s)
Finished test(python3.6): pyspark.sql.column (14s)
Finished test(python3.6): pyspark.sql.group (16s)
Finished test(python3.6): pyspark.sql.dataframe (25s)
Finished test(python3.6): pyspark.tests (164s)
Finished test(python3.6): pyspark.sql.window (6s)
Finished test(python3.6): pyspark.sql.functions (19s)
Finished test(python3.6): pyspark.streaming.util (0s)
Finished test(python3.6): pyspark.sql.readwriter (24s)
Finished test(python3.6): pyspark.sql.tests (38s)
Finished test(python3.6): pyspark.mllib.tests (133s)
Finished test(python3.6): pyspark.streaming.tests (189s)
Tests passed in 380 seconds

… cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? Currently, PySpark does not work with Python 3.6.0. Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all: ``` Traceback (most recent call last): File ".../spark/python/pyspark/shell.py", line 30, in <module> import pyspark File ".../spark/python/pyspark/__init__.py", line 46, in <module> from pyspark.context import SparkContext File ".../spark/python/pyspark/context.py", line 36, in <module> from pyspark.java_gateway import launch_gateway File ".../spark/python/pyspark/java_gateway.py", line 31, in <module> from py4j.java_gateway import java_import, JavaGateway, GatewayClient File "<frozen importlib._bootstrap>", line 961, in _find_and_load File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 646, in _load_unlocked File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module> File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module> import pkgutil File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple cls = _old_namedtuple(*args, **kwargs) TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' ``` The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628). We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments). This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this. Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0. ## How was this patch tested? Manually tested with Python 2.7.6 and Python 3.6.0. ``` ./bin/pyspsark ``` , manual creation of `namedtuple` both in local and rdd with Python 3.6.0, and Jenkins tests for other Python versions. Also, ``` ./run-tests --python-executables=python3.6 ``` ``` Will test against the following Python executables: ['python3.6'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Finished test(python3.6): pyspark.sql.tests (192s) Finished test(python3.6): pyspark.accumulators (3s) Finished test(python3.6): pyspark.mllib.tests (198s) Finished test(python3.6): pyspark.broadcast (3s) Finished test(python3.6): pyspark.conf (2s) Finished test(python3.6): pyspark.context (14s) Finished test(python3.6): pyspark.ml.classification (21s) Finished test(python3.6): pyspark.ml.evaluation (11s) Finished test(python3.6): pyspark.ml.clustering (20s) Finished test(python3.6): pyspark.ml.linalg.__init__ (0s) Finished test(python3.6): pyspark.streaming.tests (240s) Finished test(python3.6): pyspark.tests (240s) Finished test(python3.6): pyspark.ml.recommendation (19s) Finished test(python3.6): pyspark.ml.feature (36s) Finished test(python3.6): pyspark.ml.regression (37s) Finished test(python3.6): pyspark.ml.tuning (28s) Finished test(python3.6): pyspark.mllib.classification (26s) Finished test(python3.6): pyspark.mllib.evaluation (18s) Finished test(python3.6): pyspark.mllib.clustering (44s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.feature (26s) Finished test(python3.6): pyspark.mllib.fpm (23s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.ml.tests (92s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.linalg.distributed (25s) Finished test(python3.6): pyspark.mllib.stat._statistics (15s) Finished test(python3.6): pyspark.mllib.recommendation (24s) Finished test(python3.6): pyspark.mllib.regression (26s) Finished test(python3.6): pyspark.profiler (9s) Finished test(python3.6): pyspark.mllib.tree (16s) Finished test(python3.6): pyspark.shuffle (1s) Finished test(python3.6): pyspark.mllib.util (18s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.rdd (20s) Finished test(python3.6): pyspark.sql.conf (8s) Finished test(python3.6): pyspark.sql.catalog (17s) Finished test(python3.6): pyspark.sql.column (18s) Finished test(python3.6): pyspark.sql.context (18s) Finished test(python3.6): pyspark.sql.group (27s) Finished test(python3.6): pyspark.sql.dataframe (33s) Finished test(python3.6): pyspark.sql.functions (35s) Finished test(python3.6): pyspark.sql.types (6s) Finished test(python3.6): pyspark.sql.streaming (13s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.session (16s) Finished test(python3.6): pyspark.sql.window (4s) Finished test(python3.6): pyspark.sql.readwriter (35s) Tests passed in 433 seconds ``` Author: hyukjinkwon <[email protected]> Closes apache#16429 from HyukjinKwon/SPARK-19019.

SparkQA · 2017-03-21T10:28:07Z

Test build #74972 has finished for PR 17375 at commit 11e5cd7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-03-21T14:41:44Z

cc @davies

davies · 2017-03-22T00:51:37Z

lgtm

HyukjinKwon · 2017-03-23T00:22:03Z

(I will close as soon as it gets merged and the one against branch-2.0 too)

HyukjinKwon · 2017-03-27T05:06:17Z

gentle ping @davies

HyukjinKwon · 2017-03-28T01:23:47Z

cc @holdenk too.

holdenk · 2017-03-30T01:27:43Z

tentative looks good, my only question is if someone wants to use Python 3.6 (first released December 2016) are they likely to want to use it with Spark 1.6 (first released January 2016)?

HyukjinKwon · 2017-03-30T01:35:45Z

Yea, it might be less important but I guess still it is a valid backport.

zero323 · 2017-03-30T07:05:48Z

@holdenk At least some users have more control over Python environment than a whole cluster setup, and with Anaconda defaulting now to 3.6, it is an annoyance. Assuming there will be 1.6.4, it makes more sense to patch than document maximum supported version.

HyukjinKwon · 2017-04-02T09:17:55Z

(gentle ping @holdenk and @davies)

holdenk · 2017-04-03T22:25:30Z

Anaconda default to 3.6 definitely makes this make more sense, thanks @zero323 I had forgotten that. I'll give @davies until next week to say anything about this but otherwise I think the set of backports for this issue make sense.

HyukjinKwon · 2017-04-10T08:13:33Z

@holdenk Could this be merged maybe?

HyukjinKwon · 2017-04-12T03:56:20Z

@JoshRosen, do you mind if I ask a quick look here? I know you know PySpark well. I think this backport got a sign-off and a positive comment from both committers.

HyukjinKwon · 2017-04-17T02:28:59Z

gentle ping ...

holdenk · 2017-04-17T16:56:20Z

LGTM I'll merge this.

…e` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports #16429 to branch-1.6 so that Python 3.6.0 works with Spark 1.6.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.conf (5s) Finished test(python3.6): pyspark.broadcast (7s) Finished test(python3.6): pyspark.accumulators (9s) Finished test(python3.6): pyspark.rdd (16s) Finished test(python3.6): pyspark.shuffle (0s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.profiler (5s) Finished test(python3.6): pyspark.context (21s) Finished test(python3.6): pyspark.ml.clustering (12s) Finished test(python3.6): pyspark.ml.feature (16s) Finished test(python3.6): pyspark.ml.classification (16s) Finished test(python3.6): pyspark.ml.recommendation (16s) Finished test(python3.6): pyspark.ml.tuning (14s) Finished test(python3.6): pyspark.ml.regression (16s) Finished test(python3.6): pyspark.ml.evaluation (12s) Finished test(python3.6): pyspark.ml.tests (17s) Finished test(python3.6): pyspark.mllib.classification (18s) Finished test(python3.6): pyspark.mllib.evaluation (12s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (12s) Finished test(python3.6): pyspark.mllib.clustering (31s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.mllib.linalg.distributed (17s) Finished test(python3.6): pyspark.mllib.recommendation (23s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.stat._statistics (13s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.util (9s) Finished test(python3.6): pyspark.mllib.tree (14s) Finished test(python3.6): pyspark.sql.types (9s) Finished test(python3.6): pyspark.sql.context (16s) Finished test(python3.6): pyspark.sql.column (14s) Finished test(python3.6): pyspark.sql.group (16s) Finished test(python3.6): pyspark.sql.dataframe (25s) Finished test(python3.6): pyspark.tests (164s) Finished test(python3.6): pyspark.sql.window (6s) Finished test(python3.6): pyspark.sql.functions (19s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.readwriter (24s) Finished test(python3.6): pyspark.sql.tests (38s) Finished test(python3.6): pyspark.mllib.tests (133s) Finished test(python3.6): pyspark.streaming.tests (189s) Tests passed in 380 seconds ``` Author: hyukjinkwon <[email protected]> Closes #17375 from HyukjinKwon/SPARK-19019-backport-1.6.

holdenk · 2017-04-17T17:00:39Z

Merged into 1.6

…e` and port cloudpickle changes for PySpark to work with Python 3.6.0 ## What changes were proposed in this pull request? This PR proposes to backports apache#16429 to branch-1.6 so that Python 3.6.0 works with Spark 1.6.x. ## How was this patch tested? Manually, via ``` ./run-tests --python-executables=python3.6 ``` ``` Finished test(python3.6): pyspark.conf (5s) Finished test(python3.6): pyspark.broadcast (7s) Finished test(python3.6): pyspark.accumulators (9s) Finished test(python3.6): pyspark.rdd (16s) Finished test(python3.6): pyspark.shuffle (0s) Finished test(python3.6): pyspark.serializers (11s) Finished test(python3.6): pyspark.profiler (5s) Finished test(python3.6): pyspark.context (21s) Finished test(python3.6): pyspark.ml.clustering (12s) Finished test(python3.6): pyspark.ml.feature (16s) Finished test(python3.6): pyspark.ml.classification (16s) Finished test(python3.6): pyspark.ml.recommendation (16s) Finished test(python3.6): pyspark.ml.tuning (14s) Finished test(python3.6): pyspark.ml.regression (16s) Finished test(python3.6): pyspark.ml.evaluation (12s) Finished test(python3.6): pyspark.ml.tests (17s) Finished test(python3.6): pyspark.mllib.classification (18s) Finished test(python3.6): pyspark.mllib.evaluation (12s) Finished test(python3.6): pyspark.mllib.feature (19s) Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s) Finished test(python3.6): pyspark.mllib.fpm (12s) Finished test(python3.6): pyspark.mllib.clustering (31s) Finished test(python3.6): pyspark.mllib.random (8s) Finished test(python3.6): pyspark.mllib.linalg.distributed (17s) Finished test(python3.6): pyspark.mllib.recommendation (23s) Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s) Finished test(python3.6): pyspark.mllib.stat._statistics (13s) Finished test(python3.6): pyspark.mllib.regression (22s) Finished test(python3.6): pyspark.mllib.util (9s) Finished test(python3.6): pyspark.mllib.tree (14s) Finished test(python3.6): pyspark.sql.types (9s) Finished test(python3.6): pyspark.sql.context (16s) Finished test(python3.6): pyspark.sql.column (14s) Finished test(python3.6): pyspark.sql.group (16s) Finished test(python3.6): pyspark.sql.dataframe (25s) Finished test(python3.6): pyspark.tests (164s) Finished test(python3.6): pyspark.sql.window (6s) Finished test(python3.6): pyspark.sql.functions (19s) Finished test(python3.6): pyspark.streaming.util (0s) Finished test(python3.6): pyspark.sql.readwriter (24s) Finished test(python3.6): pyspark.sql.tests (38s) Finished test(python3.6): pyspark.mllib.tests (133s) Finished test(python3.6): pyspark.streaming.tests (189s) Tests passed in 380 seconds ``` Author: hyukjinkwon <[email protected]> Closes apache#17375 from HyukjinKwon/SPARK-19019-backport-1.6. (cherry picked from commit 6b315f3)

HyukjinKwon closed this Apr 17, 2017

HyukjinKwon deleted the SPARK-19019-backport-1.6 branch April 17, 2017 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17375

[SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17375

HyukjinKwon commented Mar 21, 2017

SparkQA commented Mar 21, 2017

HyukjinKwon commented Mar 21, 2017

davies commented Mar 22, 2017

HyukjinKwon commented Mar 23, 2017

HyukjinKwon commented Mar 27, 2017

HyukjinKwon commented Mar 28, 2017

holdenk commented Mar 30, 2017

HyukjinKwon commented Mar 30, 2017

zero323 commented Mar 30, 2017

HyukjinKwon commented Apr 2, 2017

holdenk commented Apr 3, 2017

HyukjinKwon commented Apr 10, 2017

HyukjinKwon commented Apr 12, 2017

HyukjinKwon commented Apr 17, 2017

holdenk commented Apr 17, 2017

holdenk commented Apr 17, 2017

[SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 #17375

[SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked collections.namedtuple and port cloudpickle changes for PySpark to work with Python 3.6.0 #17375

Conversation

HyukjinKwon commented Mar 21, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Mar 21, 2017

HyukjinKwon commented Mar 21, 2017

davies commented Mar 22, 2017

HyukjinKwon commented Mar 23, 2017

HyukjinKwon commented Mar 27, 2017

HyukjinKwon commented Mar 28, 2017

holdenk commented Mar 30, 2017

HyukjinKwon commented Mar 30, 2017

zero323 commented Mar 30, 2017

HyukjinKwon commented Apr 2, 2017

holdenk commented Apr 3, 2017

HyukjinKwon commented Apr 10, 2017

HyukjinKwon commented Apr 12, 2017

HyukjinKwon commented Apr 17, 2017

holdenk commented Apr 17, 2017

holdenk commented Apr 17, 2017

[SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17375

[SPARK-19019][PYTHON][BRANCH-1.6] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0 #17375