Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests #20473

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 1, 2018

What changes were proposed in this pull request?

This PR proposes to log if PyArrow and Pandas are installed or not so we can check if related tests are going to be skipped or not.

How was this patch tested?

Manually tested:

I don't have PyArrow installed in PyPy.

$ ./run-tests --python-executables=python3
...
Will test against the following Python executables: ['python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python3' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python3' in 'pyspark-sql' module.
Starting test(python3): pyspark.mllib.tests
Starting test(python3): pyspark.sql.tests
Starting test(python3): pyspark.streaming.tests
Starting test(python3): pyspark.tests
$ ./run-tests --modules=pyspark-streaming
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-streaming']
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.streaming.util
Starting test(python2.7): pyspark.streaming.tests
Starting test(python2.7): pyspark.streaming.util
$ ./run-tests
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module.
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.tests
Starting test(python2.7): pyspark.mllib.tests
$ ./run-tests --modules=pyspark-sql --python-executables=pypy
...
Will test against the following Python executables: ['pypy']
Will test the following Python modules: ['pyspark-sql']
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module.
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.sql.catalog
Starting test(pypy): pyspark.sql.column
Starting test(pypy): pyspark.sql.conf

After some modification to produce other cases:

$ ./run-tests
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow 0.8.0 was found.
Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.20.2 was found.
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow was not found.
Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.22.0 was found.
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.tests
Starting test(python2.7): pyspark.mllib.tests
./run-tests-with-coverage
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module.
Coverage is not installed in Python executable 'pypy' but 'COVERAGE_PROCESS_START' environment variable is set, exiting.

@HyukjinKwon
Copy link
Member Author

@ueshin, @cloud-fan, @yhuai, @felixcheung and @BryanCutler, I tried to log it here. Could you take a look and see if it makes sense to you?

try:
subprocess_check_output(
[python_exec, "-c", "import pyarrow"],
stderr=open(os.devnull, 'w'))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, it prints out the exception too, for example:

Will test the following Python modules: ['pyspark-sql']
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named foo
PyArrow is not installed in Python executable 'python2.7', skipping related tests in 'pyspark-sql'.

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86929 has finished for PR 20473 at commit 0261045.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Current Jenkins output was:

========================================================================
Running PySpark tests
========================================================================
Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-sql', 'pyspark-streaming', 'pyspark-mllib', 'pyspark-ml']
PyArrow is not installed in Python executable 'python2.7', skipping related tests in 'pyspark-sql'.
PyArrow is not installed in Python executable 'pypy', skipping related tests in 'pyspark-sql'.
Pandas is not installed in Python executable 'pypy', skipping related tests in 'pyspark-sql'.
Starting test(pypy): pyspark.sql.tests
Starting test(python2.7): pyspark.mllib.tests

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86930 has finished for PR 20473 at commit e7d752f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
as a future enhancement, perhaps we should check for the version loaded? eg. pyarrow version

subprocess_check_output(
[python_exec, "-c", "import pyarrow"],
stderr=open(os.devnull, 'w'))
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we also explicitly mention that pyarrow/pandas related tests will run if they are installed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since we are here, is it possible to do the same thing as

_have_pandas = False
_have_old_pandas = False
try:
import pandas
try:
from pyspark.sql.utils import require_minimum_pandas_version
require_minimum_pandas_version()
_have_pandas = True
except:
_have_old_pandas = True
except:
# No Pandas, but that's okay, we'll skip those tests
pass
and
_have_arrow = False
try:
import pyarrow
_have_arrow = True
except:
# No Arrow, but that's okay, we'll skip those tests
pass
?

It will be nice to use the same logic. Otherwise, even we do not print the warning at here, tests may still get skipped because of the version issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, hm. I believe we don't access to our main pyspark here. Let me check if I can address your concern today (or late tonight KST).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#20473 (comment) is easy but I think #20473 (comment) makes things complicated.

Let me try it to show how it looks like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Appreciate it.

[python_exec, "-c", "import pyarrow; print(pyarrow.__version__)"],
universal_newlines=True,
stderr=open(os.devnull, 'w')).strip()
if LooseVersion(pyarrow_version) >= LooseVersion('0.8.0'):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can't easily reuse require_minimum_pandas_version or require_minimum_pyarrow_version since it looks it's not guaranteed to access to our main pyspark here. I tries to address the comments at my best. Also updated PR description. Please check the logs above.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I like this idea.
LGTM except for some comments.

[python_exec, "-c", "import pyarrow; print(pyarrow.__version__)"],
universal_newlines=True,
stderr=open(os.devnull, 'w')).strip()
if LooseVersion(pyarrow_version) >= LooseVersion('0.8.0'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have 0.8.0 as a variable in this file, or hopefully somewhere global if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, hm .. I think I am not sure of a good place to put them as globals .. let me just make a variable here. Let me leave a comment there too.

[python_exec, "-c", "import pandas; print(pandas.__version__)"],
universal_newlines=True,
stderr=open(os.devnull, 'w')).strip()
if LooseVersion(pandas_version) >= LooseVersion('0.19.2'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86965 has finished for PR 20473 at commit 014612a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

The message seems now much clear :D.

========================================================================
Running PySpark tests
========================================================================
Running PySpark tests. Output is in /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log
Will test against the following Python executables: ['python2.7', 'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-sql', 'pyspark-streaming', 'pyspark-mllib', 'pyspark-ml']
Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas 0.16.0 was found.
Will test PyArrow related features against Python executable 'python3.4' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python3.4' in 'pyspark-sql' module.
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 0.19.2 is required; however, Pandas was not found.

@HyukjinKwon
Copy link
Member Author

BTW @ueshin,

I think we should use require_minimum_pyarrow_version here:

_have_arrow = False
try:
import pyarrow
_have_arrow = True
except:
# No Arrow, but that's okay, we'll skip those tests
pass

because we declared >= 0.8.0 as you already know:

'sql': ['pandas>=0.19.2', 'pyarrow>=0.8.0']

Would you like me to double check other files and fix it separately or would you be willing to fix it?

# If we should test 'pyspark-sql', it checks if PyArrow and Pandas are installed and
# explicitly prints out. See SPARK-23300.
if pyspark_sql in modules_to_test:
# Hyukjin: I think here is not the best place to leave versions for extra dependencies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that, I'd agree...

Copy link
Member

@felixcheung felixcheung Feb 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure .. I was thinking of putting this in ./dev/sparktestsupport/modules.py too but .. I believe this should be done separately. We should replace these too:

if LooseVersion(pandas.__version__) < LooseVersion('0.19.2'):

if LooseVersion(pyarrow.__version__) < LooseVersion('0.8.0'):

@SparkQA
Copy link

SparkQA commented Feb 2, 2018

Test build #86972 has finished for PR 20473 at commit b726330.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Feb 2, 2018

@HyukjinKwon Good catch! Yeah, we should use it there. Could you fix it please? Thanks!

@HyukjinKwon
Copy link
Member Author

Will double check and open a PR tonight ..

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The print outs look great! I'm not sure if there is a better way to get the required version numbers, but this is probably good for now, thanks @HyukjinKwon!

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 4, 2018

Test build #87046 has finished for PR 20473 at commit b726330.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if pyspark_sql in modules_to_test:
# TODO(HyukjinKwon): Relocate and deduplicate these version specifications.
minimum_pyarrow_version = '0.8.0'
minimum_pandas_version = '0.19.2'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the last commit,

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87055 has finished for PR 20473 at commit fe2943e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87056 has finished for PR 20473 at commit 78f5879.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Will merge this one if there's no more comments in few days.

@ueshin
Copy link
Member

ueshin commented Feb 5, 2018

LGTM.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@HyukjinKwon
Copy link
Member Author

Thank you @felixcheung, @yhuai, @ueshin and @BryanCutler for reviewing this.

@asfgit asfgit closed this in 8141c3e Feb 6, 2018
@HyukjinKwon
Copy link
Member Author

Let me actually backport this to bracnh-2.3. I think there isn't any downside or harm to backport it.

HyukjinKwon added a commit to HyukjinKwon/spark that referenced this pull request Feb 7, 2018
…r not in PySpark SQL tests

This PR proposes to log if PyArrow and Pandas are installed or not so we can check if related tests are going to be skipped or not.

Manually tested:

I don't have PyArrow installed in PyPy.
```bash
$ ./run-tests --python-executables=python3
```

```
...
Will test against the following Python executables: ['python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python3' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python3' in 'pyspark-sql' module.
Starting test(python3): pyspark.mllib.tests
Starting test(python3): pyspark.sql.tests
Starting test(python3): pyspark.streaming.tests
Starting test(python3): pyspark.tests
```

```bash
$ ./run-tests --modules=pyspark-streaming
```

```
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-streaming']
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.streaming.util
Starting test(python2.7): pyspark.streaming.tests
Starting test(python2.7): pyspark.streaming.util
```

```bash
$ ./run-tests
```

```
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module.
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.tests
Starting test(python2.7): pyspark.mllib.tests
```

```bash
$ ./run-tests --modules=pyspark-sql --python-executables=pypy
```

```
...
Will test against the following Python executables: ['pypy']
Will test the following Python modules: ['pyspark-sql']
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found.
Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module.
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.sql.catalog
Starting test(pypy): pyspark.sql.column
Starting test(pypy): pyspark.sql.conf
```

After some modification to produce other cases:

```bash
$ ./run-tests
```

```
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow 0.8.0 was found.
Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.20.2 was found.
Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow was not found.
Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.22.0 was found.
Starting test(pypy): pyspark.sql.tests
Starting test(pypy): pyspark.streaming.tests
Starting test(pypy): pyspark.tests
Starting test(python2.7): pyspark.mllib.tests
```

```bash
./run-tests-with-coverage
```
```
...
Will test against the following Python executables: ['python2.7', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module.
Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module.
Coverage is not installed in Python executable 'pypy' but 'COVERAGE_PROCESS_START' environment variable is set, exiting.
```

Author: hyukjinkwon <[email protected]>

Closes apache#20473 from HyukjinKwon/SPARK-23300.
asfgit pushed a commit that referenced this pull request Feb 8, 2018
… installed or not in PySpark SQL tests

This PR backports #20473 to branch-2.3.

Author: hyukjinkwon <[email protected]>

Closes #20533 from HyukjinKwon/backport-20473.
@HyukjinKwon HyukjinKwon deleted the SPARK-23300 branch October 16, 2018 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants