-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23300][TESTS] Prints out if Pandas and PyArrow are installed or not in PySpark SQL tests #20473
Conversation
@ueshin, @cloud-fan, @yhuai, @felixcheung and @BryanCutler, I tried to log it here. Could you take a look and see if it makes sense to you? |
0261045
to
e7d752f
Compare
python/run-tests.py
Outdated
try: | ||
subprocess_check_output( | ||
[python_exec, "-c", "import pyarrow"], | ||
stderr=open(os.devnull, 'w')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, it prints out the exception too, for example:
Will test the following Python modules: ['pyspark-sql']
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: No module named foo
PyArrow is not installed in Python executable 'python2.7', skipping related tests in 'pyspark-sql'.
Test build #86929 has finished for PR 20473 at commit
|
Current Jenkins output was:
|
Test build #86930 has finished for PR 20473 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
as a future enhancement, perhaps we should check for the version loaded? eg. pyarrow version
python/run-tests.py
Outdated
subprocess_check_output( | ||
[python_exec, "-c", "import pyarrow"], | ||
stderr=open(os.devnull, 'w')) | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we also explicitly mention that pyarrow/pandas related tests will run if they are installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, since we are here, is it possible to do the same thing as
spark/python/pyspark/sql/tests.py
Lines 51 to 63 in ec63e2d
_have_pandas = False | |
_have_old_pandas = False | |
try: | |
import pandas | |
try: | |
from pyspark.sql.utils import require_minimum_pandas_version | |
require_minimum_pandas_version() | |
_have_pandas = True | |
except: | |
_have_old_pandas = True | |
except: | |
# No Pandas, but that's okay, we'll skip those tests | |
pass |
spark/python/pyspark/sql/tests.py
Lines 78 to 84 in ec63e2d
_have_arrow = False | |
try: | |
import pyarrow | |
_have_arrow = True | |
except: | |
# No Arrow, but that's okay, we'll skip those tests | |
pass |
It will be nice to use the same logic. Otherwise, even we do not print the warning at here, tests may still get skipped because of the version issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, hm. I believe we don't access to our main pyspark
here. Let me check if I can address your concern today (or late tonight KST).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#20473 (comment) is easy but I think #20473 (comment) makes things complicated.
Let me try it to show how it looks like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Appreciate it.
python/run-tests.py
Outdated
[python_exec, "-c", "import pyarrow; print(pyarrow.__version__)"], | ||
universal_newlines=True, | ||
stderr=open(os.devnull, 'w')).strip() | ||
if LooseVersion(pyarrow_version) >= LooseVersion('0.8.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can't easily reuse require_minimum_pandas_version
or require_minimum_pyarrow_version
since it looks it's not guaranteed to access to our main pyspark
here. I tries to address the comments at my best. Also updated PR description. Please check the logs above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I like this idea.
LGTM except for some comments.
python/run-tests.py
Outdated
[python_exec, "-c", "import pyarrow; print(pyarrow.__version__)"], | ||
universal_newlines=True, | ||
stderr=open(os.devnull, 'w')).strip() | ||
if LooseVersion(pyarrow_version) >= LooseVersion('0.8.0'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have 0.8.0
as a variable in this file, or hopefully somewhere global if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, hm .. I think I am not sure of a good place to put them as globals .. let me just make a variable here. Let me leave a comment there too.
python/run-tests.py
Outdated
[python_exec, "-c", "import pandas; print(pandas.__version__)"], | ||
universal_newlines=True, | ||
stderr=open(os.devnull, 'w')).strip() | ||
if LooseVersion(pandas_version) >= LooseVersion('0.19.2'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
Test build #86965 has finished for PR 20473 at commit
|
The message seems now much clear :D.
|
BTW @ueshin, I think we should use spark/python/pyspark/sql/tests.py Lines 78 to 84 in ec63e2d
because we declared >= 0.8.0 as you already know: Line 204 in b8bfce5
Would you like me to double check other files and fix it separately or would you be willing to fix it? |
python/run-tests.py
Outdated
# If we should test 'pyspark-sql', it checks if PyArrow and Pandas are installed and | ||
# explicitly prints out. See SPARK-23300. | ||
if pyspark_sql in modules_to_test: | ||
# Hyukjin: I think here is not the best place to leave versions for extra dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that, I'd agree...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we grep this https://github.com/apache/spark/blob/master/python/setup.py#L204
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure .. I was thinking of putting this in ./dev/sparktestsupport/modules.py
too but .. I believe this should be done separately. We should replace these too:
spark/python/pyspark/sql/utils.py
Line 120 in 12d20dd
if LooseVersion(pandas.__version__) < LooseVersion('0.19.2'): |
spark/python/pyspark/sql/utils.py
Line 130 in 12d20dd
if LooseVersion(pyarrow.__version__) < LooseVersion('0.8.0'): |
Test build #86972 has finished for PR 20473 at commit
|
@HyukjinKwon Good catch! Yeah, we should use it there. Could you fix it please? Thanks! |
Will double check and open a PR tonight .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The print outs look great! I'm not sure if there is a better way to get the required version numbers, but this is probably good for now, thanks @HyukjinKwon!
retest this please |
Test build #87046 has finished for PR 20473 at commit
|
4f77fc3
to
fe2943e
Compare
if pyspark_sql in modules_to_test: | ||
# TODO(HyukjinKwon): Relocate and deduplicate these version specifications. | ||
minimum_pyarrow_version = '0.8.0' | ||
minimum_pandas_version = '0.19.2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the last commit,
-
I only replaced
minimal_pyarrow_version
->minimum_pyarrow_version
andminimal_pandas_version
->minimum_pandas_version
-
Replaced the comment to
# TODO(HyukjinKwon): Relocate and deduplicate these version specifications.
, to match it to [SPARK-23319][TESTS] Explicitly specify Pandas and PyArrow versions in PySpark tests (to skip or test) #20487.
Test build #87055 has finished for PR 20473 at commit
|
Test build #87056 has finished for PR 20473 at commit
|
Will merge this one if there's no more comments in few days. |
LGTM. |
Merged to master. |
Thank you @felixcheung, @yhuai, @ueshin and @BryanCutler for reviewing this. |
Let me actually backport this to bracnh-2.3. I think there isn't any downside or harm to backport it. |
…r not in PySpark SQL tests This PR proposes to log if PyArrow and Pandas are installed or not so we can check if related tests are going to be skipped or not. Manually tested: I don't have PyArrow installed in PyPy. ```bash $ ./run-tests --python-executables=python3 ``` ``` ... Will test against the following Python executables: ['python3'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python3' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python3' in 'pyspark-sql' module. Starting test(python3): pyspark.mllib.tests Starting test(python3): pyspark.sql.tests Starting test(python3): pyspark.streaming.tests Starting test(python3): pyspark.tests ``` ```bash $ ./run-tests --modules=pyspark-streaming ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-streaming'] Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.streaming.util Starting test(python2.7): pyspark.streaming.tests Starting test(python2.7): pyspark.streaming.util ``` ```bash $ ./run-tests ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found. Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.tests Starting test(python2.7): pyspark.mllib.tests ``` ```bash $ ./run-tests --modules=pyspark-sql --python-executables=pypy ``` ``` ... Will test against the following Python executables: ['pypy'] Will test the following Python modules: ['pyspark-sql'] Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 0.8.0 is required; however, PyArrow was not found. Will test Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.sql.catalog Starting test(pypy): pyspark.sql.column Starting test(pypy): pyspark.sql.conf ``` After some modification to produce other cases: ```bash $ ./run-tests ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will skip PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow 0.8.0 was found. Will skip Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.20.2 was found. Will skip PyArrow related features against Python executable 'pypy' in 'pyspark-sql' module. PyArrow >= 20.0.0 is required; however, PyArrow was not found. Will skip Pandas related features against Python executable 'pypy' in 'pyspark-sql' module. Pandas >= 20.0.0 is required; however, Pandas 0.22.0 was found. Starting test(pypy): pyspark.sql.tests Starting test(pypy): pyspark.streaming.tests Starting test(pypy): pyspark.tests Starting test(python2.7): pyspark.mllib.tests ``` ```bash ./run-tests-with-coverage ``` ``` ... Will test against the following Python executables: ['python2.7', 'pypy'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Will test PyArrow related features against Python executable 'python2.7' in 'pyspark-sql' module. Will test Pandas related features against Python executable 'python2.7' in 'pyspark-sql' module. Coverage is not installed in Python executable 'pypy' but 'COVERAGE_PROCESS_START' environment variable is set, exiting. ``` Author: hyukjinkwon <[email protected]> Closes apache#20473 from HyukjinKwon/SPARK-23300.
… installed or not in PySpark SQL tests This PR backports #20473 to branch-2.3. Author: hyukjinkwon <[email protected]> Closes #20533 from HyukjinKwon/backport-20473.
What changes were proposed in this pull request?
This PR proposes to log if PyArrow and Pandas are installed or not so we can check if related tests are going to be skipped or not.
How was this patch tested?
Manually tested:
I don't have PyArrow installed in PyPy.
After some modification to produce other cases: