[SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests #20473

HyukjinKwon · 2018-02-01T13:05:19Z

What changes were proposed in this pull request?

This PR proposes to log if PyArrow and Pandas are installed or not so we can check if related tests are going to be skipped or not.

How was this patch tested?

Manually tested:

I don't have PyArrow installed in PyPy.

$ ./run-tests --python-executables=python3

...
Will test against the following Python executables: ['python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python3' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python3' in 'pyspark-sql' module.
Starting test(python3): pyspark.mllib.tests
Starting test(python3): pyspark.sql.tests
Starting test(python3): pyspark.streaming.tests
Starting test(python3): pyspark.tests

$ ./run-tests --modules=pyspark-streaming

...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-streaming']
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.streaming.util
Starting test(python2.7): pyspark.streaming.tests
Starting test(python2.7): pyspark.streaming.util

$ ./run-tests

...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module.
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.tests
Starting test(python2.7): pyspark.mllib.tests

$ ./run-tests --modules=pyspark-sql --python-executables=pypy

...
Will test against the following Python executables: ['pypy']
Will test the following Python modules: ['pyspark-sql']
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module.
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.sql.catalog
Starting test(pypy): pyspark.sql.column
Starting test(pypy): pyspark.sql.conf

After some modification to produce other cases:

$ ./run-tests

...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow 0.8.0 was found.
Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.20.2 was found.
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow was not found.
Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.22.0 was found.
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.tests
Starting test(python2.7): pyspark.mllib.tests

./run-tests-with-coverage

...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module.
Coverage is not installed in Python executable 'pypy' but 'COVERAGE_PROCESS_START' environment variable is set, exiting.

HyukjinKwon · 2018-02-01T13:06:33Z

@ueshin, @cloud-fan, @yhuai, @felixcheung and @BryanCutler, I tried to log it here. Could you take a look and see if it makes sense to you?

HyukjinKwon · 2018-02-01T13:09:36Z

python/run-tests.py

+        try:
+            subprocess_check_output(
+                [python_exec, "-c", "import pyarrow"],
+                stderr=open(os.devnull, 'w'))


Otherwise, it prints out the exception too, for example:

Will test the following Python modules: ['pyspark-sql'] Traceback (most recent call last): File "<string>", line 1, in <module> ImportError: No module named foo PyArrow is not installed in Python executable 'python2.7', skipping related tests in 'pyspark-sql'.

SparkQA · 2018-02-01T13:42:43Z

Test build #86929 has finished for PR 20473 at commit 0261045.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-01T13:44:30Z

Current Jenkins output was:

========================================================================
Running PySpark tests
========================================================================
Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-sql', 'pyspark-streaming', 'pyspark-mllib', 'pyspark-ml']
PyArrow is not installed in Python executable 'python2.7', skipping related tests in 'pyspark-sql'.
PyArrow is not installed in Python executable 'pypy', skipping related tests in 'pyspark-sql'.
Pandas is not installed in Python executable 'pypy', skipping related tests in 'pyspark-sql'.
Starting test(pypy): pyspark.sql.tests
Starting test(python2.7): pyspark.mllib.tests

SparkQA · 2018-02-01T13:56:02Z

Test build #86930 has finished for PR 20473 at commit e7d752f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM.
as a future enhancement, perhaps we should check for the version loaded? eg. pyarrow version

yhuai · 2018-02-01T18:26:08Z

python/run-tests.py

+            subprocess_check_output(
+                [python_exec, "-c", "import pyarrow"],
+                stderr=open(os.devnull, 'w'))
+        except:


How about we also explicitly mention that pyarrow/pandas related tests will run if they are installed?

Actually, since we are here, is it possible to do the same thing as

spark/python/pyspark/sql/tests.py

Lines 51 to 63 in ec63e2d

_have_pandas = False

_have_old_pandas = False

try:

import pandas

try:

from pyspark.sql.utils import require_minimum_pandas_version

require_minimum_pandas_version()

_have_pandas = True

except:

_have_old_pandas = True

except:

# No Pandas, but that's okay, we'll skip those tests

pass

and

spark/python/pyspark/sql/tests.py

Lines 78 to 84 in ec63e2d

_have_arrow = False

try:

import pyarrow

_have_arrow = True

except:

# No Arrow, but that's okay, we'll skip those tests

pass

?

It will be nice to use the same logic. Otherwise, even we do not print the warning at here, tests may still get skipped because of the version issue.

Ah, hm. I believe we don't access to our main pyspark here. Let me check if I can address your concern today (or late tonight KST).

#20473 (comment) is easy but I think #20473 (comment) makes things complicated.

Let me try it to show how it looks like.

Thank you. Appreciate it.

HyukjinKwon · 2018-02-02T05:31:59Z

python/run-tests.py

+                [python_exec, "-c", "import pyarrow; print(pyarrow.__version__)"],
+                universal_newlines=True,
+                stderr=open(os.devnull, 'w')).strip()
+            if LooseVersion(pyarrow_version) >= LooseVersion('0.8.0'):


I think I can't easily reuse require_minimum_pandas_version or require_minimum_pyarrow_version since it looks it's not guaranteed to access to our main pyspark here. I tries to address the comments at my best. Also updated PR description. Please check the logs above.

ueshin

Thanks for working on this! I like this idea.
LGTM except for some comments.

ueshin · 2018-02-02T05:45:11Z

python/run-tests.py

+                [python_exec, "-c", "import pyarrow; print(pyarrow.__version__)"],
+                universal_newlines=True,
+                stderr=open(os.devnull, 'w')).strip()
+            if LooseVersion(pyarrow_version) >= LooseVersion('0.8.0'):


Let's have 0.8.0 as a variable in this file, or hopefully somewhere global if possible.

Ah, hm .. I think I am not sure of a good place to put them as globals .. let me just make a variable here. Let me leave a comment there too.

ueshin · 2018-02-02T05:45:23Z

python/run-tests.py

+                [python_exec, "-c", "import pandas; print(pandas.__version__)"],
+                universal_newlines=True,
+                stderr=open(os.devnull, 'w')).strip()
+            if LooseVersion(pandas_version) >= LooseVersion('0.19.2'):


SparkQA · 2018-02-02T06:07:53Z

Test build #86965 has finished for PR 20473 at commit 014612a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-02T06:59:57Z

The message seems now much clear :D.

========================================================================
Running PySpark tests
========================================================================
Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-sql', 'pyspark-streaming', 'pyspark-mllib', 'pyspark-ml']
Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas 0.16.0 was found.
Will test PyArrow related features against Python executable 'python3.4' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python3.4' in 'pyspark-sql' module.
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas was not found.

HyukjinKwon · 2018-02-02T07:20:47Z

BTW @ueshin,

I think we should use require_minimum_pyarrow_version here:

spark/python/pyspark/sql/tests.py

Lines 78 to 84 in ec63e2d

    
           _have_arrow = False 
        
           try: 
        
               import pyarrow 
        
               _have_arrow = True 
        
           except: 
        
               # No Arrow, but that's okay, we'll skip those tests 
        
               pass

because we declared >= 0.8.0 as you already know:

spark/python/setup.py

Line 204 in b8bfce5

'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']

Would you like me to double check other files and fix it separately or would you be willing to fix it?

felixcheung · 2018-02-02T07:21:06Z

python/run-tests.py

+    # If we should test 'pyspark-sql', it checks if PyArrow and Pandas are installed and
+    # explicitly prints out. See SPARK-23300.
+    if pyspark_sql in modules_to_test:
+        # Hyukjin: I think here is not the best place to leave versions for extra dependencies.


that, I'd agree...

could we grep this https://github.com/apache/spark/blob/master/python/setup.py#L204

Not sure .. I was thinking of putting this in ./dev/sparktestsupport/modules.py too but .. I believe this should be done separately. We should replace these too:

spark/python/pyspark/sql/utils.py

Line 120 in 12d20dd

if LooseVersion(pandas.__version__) < LooseVersion('0.19.2'):

spark/python/pyspark/sql/utils.py

Line 130 in 12d20dd

if LooseVersion(pyarrow.__version__) < LooseVersion('0.8.0'):

SparkQA · 2018-02-02T07:23:04Z

Test build #86972 has finished for PR 20473 at commit b726330.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-02-02T07:26:39Z

@HyukjinKwon Good catch! Yeah, we should use it there. Could you fix it please? Thanks!

HyukjinKwon · 2018-02-02T07:28:02Z

Will double check and open a PR tonight ..

BryanCutler

The print outs look great! I'm not sure if there is a better way to get the required version numbers, but this is probably good for now, thanks @HyukjinKwon!

HyukjinKwon · 2018-02-04T08:56:59Z

retest this please

SparkQA · 2018-02-04T09:31:37Z

Test build #87046 has finished for PR 20473 at commit b726330.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…tests

HyukjinKwon · 2018-02-05T02:05:27Z

python/run-tests.py

+    if pyspark_sql in modules_to_test:
+        # TODO(HyukjinKwon): Relocate and deduplicate these version specifications.
+        minimum_pyarrow_version = '0.8.0'
+        minimum_pandas_version = '0.19.2'


In the last commit,

I only replaced minimal_pyarrow_version -> minimum_pyarrow_version and minimal_pandas_version -> minimum_pandas_version

Replaced the comment to # TODO(HyukjinKwon): Relocate and deduplicate these version specifications., to match it to [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20487.

SparkQA · 2018-02-05T02:21:25Z

Test build #87055 has finished for PR 20473 at commit fe2943e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-05T02:41:12Z

Test build #87056 has finished for PR 20473 at commit 78f5879.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-05T03:08:13Z

Will merge this one if there's no more comments in few days.

ueshin · 2018-02-05T04:08:53Z

LGTM.

HyukjinKwon · 2018-02-06T07:08:36Z

Merged to master.

HyukjinKwon · 2018-02-06T07:09:28Z

Thank you @felixcheung, @yhuai, @ueshin and @BryanCutler for reviewing this.

HyukjinKwon · 2018-02-07T13:58:30Z

Let me actually backport this to bracnh-2.3. I think there isn't any downside or harm to backport it.

…r not in PySpark SQL tests This PR proposes to log if PyArrow and Pandas are installed or not so we can check if related tests are going to be skipped or not. Manually tested: I don't have PyArrow installed in PyPy. ```bash $ ./run-tests --python-executables=python3 ``` ``` ... Will test against the following Python executables: ['python3'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python3' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python3' in 'pyspark-sql' module. Starting test(python3): pyspark.mllib.tests Starting test(python3): pyspark.sql.tests Starting test(python3): pyspark.streaming.tests Starting test(python3): pyspark.tests ``` ```bash $ ./run-tests --modules=pyspark-streaming ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-streaming'] Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.streaming.util Starting test(python2.7): pyspark.streaming.tests Starting test(python2.7): pyspark.streaming.util ``` ```bash $ ./run-tests ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found. Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.tests Starting test(python2.7): pyspark.mllib.tests ``` ```bash $ ./run-tests --modules=pyspark-sql --python-executables=pypy ``` ``` ... Will test against the following Python executables: ['pypy'] Will test the following Python modules: ['pyspark-sql'] Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found. Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.sql.catalog Starting test(pypy): pyspark.sql.column Starting test(pypy): pyspark.sql.conf ``` After some modification to produce other cases: ```bash $ ./run-tests ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow 0.8.0 was found. Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.20.2 was found. Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow was not found. Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.22.0 was found. Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.tests Starting test(python2.7): pyspark.mllib.tests ``` ```bash ./run-tests-with-coverage ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Coverage is not installed in Python executable 'pypy' but 'COVERAGE_PROCESS_START' environment variable is set, exiting. ``` Author: hyukjinkwon <[email protected]> Closes apache#20473 from HyukjinKwon/SPARK-23300.

… installed or not in PySpark SQL tests This PR backports #20473 to branch-2.3. Author: hyukjinkwon <[email protected]> Closes #20533 from HyukjinKwon/backport-20473.

HyukjinKwon force-pushed the SPARK-23300 branch from 0261045 to e7d752f Compare February 1, 2018 13:07

HyukjinKwon commented Feb 1, 2018

View reviewed changes

felixcheung approved these changes Feb 1, 2018

View reviewed changes

yhuai reviewed Feb 1, 2018

View reviewed changes

HyukjinKwon commented Feb 2, 2018

View reviewed changes

ueshin reviewed Feb 2, 2018

View reviewed changes

felixcheung reviewed Feb 2, 2018

View reviewed changes

BryanCutler approved these changes Feb 2, 2018

View reviewed changes

Prints out if Pandas and PyArrow are installed or not in PySpark SQL …

fe2943e

…tests

HyukjinKwon force-pushed the SPARK-23300 branch from 4f77fc3 to fe2943e Compare February 5, 2018 01:59

Replace it to a prettier comment

78f5879

HyukjinKwon commented Feb 5, 2018

View reviewed changes

HyukjinKwon mentioned this pull request Feb 5, 2018

[SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20487

Closed

asfgit closed this in 8141c3e Feb 6, 2018

HyukjinKwon mentioned this pull request Feb 7, 2018

[SPARK-23300][TESTS][BRANCH-2.3] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests #20533

Closed

HyukjinKwon deleted the SPARK-23300 branch October 16, 2018 12:44

	_have_pandas = False
	_have_old_pandas = False
	try:
	import pandas
	try:
	from pyspark.sql.utils import require_minimum_pandas_version
	require_minimum_pandas_version()
	_have_pandas = True
	except:
	_have_old_pandas = True
	except:
	# No Pandas, but that's okay, we'll skip those tests
	pass

	_have_arrow = False
	try:
	import pyarrow
	_have_arrow = True
	except:
	# No Arrow, but that's okay, we'll skip those tests
	pass

[SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests #20473

[SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests #20473

Conversation

HyukjinKwon commented Feb 1, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Feb 1, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 1, 2018

HyukjinKwon commented Feb 1, 2018

SparkQA commented Feb 1, 2018

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2018

HyukjinKwon commented Feb 2, 2018

HyukjinKwon commented Feb 2, 2018

Choose a reason for hiding this comment

felixcheung Feb 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2018

ueshin commented Feb 2, 2018

HyukjinKwon commented Feb 2, 2018

BryanCutler left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 4, 2018

SparkQA commented Feb 4, 2018

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2018

SparkQA commented Feb 5, 2018

HyukjinKwon commented Feb 5, 2018

ueshin commented Feb 5, 2018

HyukjinKwon commented Feb 6, 2018

HyukjinKwon commented Feb 6, 2018

HyukjinKwon commented Feb 7, 2018

HyukjinKwon commented Feb 1, 2018 •

edited

Loading

felixcheung Feb 2, 2018 •

edited

Loading