[SPARK-7721][PYTHON][TESTS] Adds PySpark coverage generation script #20204

HyukjinKwon · 2018-01-09T15:54:20Z

What changes were proposed in this pull request?

Note that this PR was made based on the top of #20151. So, it almost leaves the main codes intact.

This PR proposes to add a script for the preparation of automatic PySpark coverage generation. Now, it's difficult to check the actual coverage in case of PySpark. With this script, it allows to run tests by the way we did via run-tests script before. The usage is exactly the same with run-tests script as this basically wraps it.

This script and PR alone should also be useful. I was asked about how to run this before, and seems some reviewers (including me) need this. It would be also useful to run it manually.

It usually requires a small diff in normal Python projects but PySpark cases are a bit different because apparently we are unable to track the coverage after it's forked. So, here, I made a custom worker that forces the coverage, based on the top of #20151.

I made a simple demo. Please take a look - https://spark-test.github.io/pyspark-coverage-site.

To show up the structure, this PR adds the files as below:

python
├── .coveragerc  # Runtime configuration when we run the script.
├── run-tests-with-coverage  # The script that has coverage support and wraps run-tests script.
└── test_coverage  # Directories that have files required when running coverage.
    ├── conf
    │   └── spark-defaults.conf  # Having the configuration 'spark.python.daemon.module'.
    ├── coverage_daemon.py  # A daemon having custom fix and wrapping our daemon.py
    └── sitecustomize.py  # Initiate coverage with COVERAGE_PROCESS_START

Note that this PR has a minor nit:

This scope in daemon.py is not in the coverage results as basically I am producing the coverage results in worker.py separately and then merging it. I believe it's not a big deal.

In a followup, I might have a site that has a single up-to-date PySpark coverage from the master branch as the fallback / default, or have a site that has multiple PySpark coverages and the site link will be left to each pull request.

How was this patch tested?

Manually tested. Usage is the same with the existing Python test script - ./python/run-tests. For example,

sh run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql

Running this will generate HTMLs under ./python/test_coverage/htmlcov.

Console output example:

sh run-tests-with-coverage --python-executables=python3,python --modules=pyspark-core
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python3', 'python']
Will test the following Python modules: ['pyspark-core']
Starting test(python): pyspark.tests
Starting test(python3): pyspark.tests
...
Tests passed in 231 seconds
Combining collected coverage data under /.../spark/python/test_coverage/coverage_data
Reporting the coverage data at /...spark/python/test_coverage/coverage_data/coverage
Name                         Stmts   Miss Branch BrPart  Cover
--------------------------------------------------------------
pyspark/__init__.py             41      0      8      2    96%
...
pyspark/profiler.py             74     11     22      5    83%
pyspark/rdd.py                 871     40    303     32    93%
pyspark/rddsampler.py           68     10     32      2    82%
...
--------------------------------------------------------------
TOTAL                         8521   3077   2748    191    59%
Generating HTML files for PySpark coverage under /.../spark/python/test_coverage/htmlcov

HyukjinKwon · 2018-01-09T15:55:10Z

python/run-tests.py

@@ -175,6 +175,9 @@ def main():

    task_queue = Queue.PriorityQueue()
    for python_exec in python_execs:
+        if "COVERAGE_PROCESS_START" in os.environ:
+            # Make sure if coverage is installed.
+            run_cmd([python_exec, "-c", "import coverage"])


Manually tested:

./run-tests-with-coverage --python-executables=python3 Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python3'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'foo' [error] running python3 -c import foo ; received return code 1

HyukjinKwon · 2018-01-09T15:56:48Z

cc @rxin, @felixcheung, @ueshin, @holdenk and @JoshRosen. Here, I made it based on #20151. Could you take a look for this one too? I can make this PR separate from #20151 by manually adding the custom fix into worker.py if requested.

HyukjinKwon · 2018-01-09T15:58:05Z

cc @icexelloss too. I think we had a small talk about this before.

icexelloss · 2018-01-09T16:11:29Z

This is awesome. Thanks @HyukjinKwon!

HyukjinKwon · 2018-01-09T18:06:31Z

Sorry for noise. I made some clean up and just rebased.

SparkQA · 2018-01-09T19:05:46Z

Test build #85856 has finished for PR 20204 at commit 9f2c400.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-09T19:41:26Z

Test build #85855 has finished for PR 20204 at commit a3179d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-09T20:17:10Z

Test build #85858 has finished for PR 20204 at commit 3c3c3cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-09T21:58:43Z

Test build #85862 has finished for PR 20204 at commit dadcae4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-01-09T22:11:37Z

I usually run tests like this:

SPARK_TESTING=1 bin/pyspark pyspark.sql.tests ...

Does it make sense to expose similar usage for enabling coverage? Sth like

SPARK_TESTING_WITH_COVERAGE=1 bin/pyspark pyspark.sql.tests ...

felixcheung

looks reasonable to me.
how are we going to track changes to coverage number?

HyukjinKwon · 2018-01-10T13:09:58Z

If anyone is worried of elapsed time, I think seems generally fine:

Before:

sh run-tests --python-executables=python3 --modules=pyspark-sql

...
Finished test(python3): pyspark.sql.readwriter (34s)
Finished test(python3): pyspark.sql.tests (241s)
Tests passed in 241 seconds

After:

sh run-tests-with-coverage  --python-executables=python3 --modules=pyspark-sql

...
Finished test(python3): pyspark.sql.conf (288s)
Finished test(python3): pyspark.sql.tests (288s)
Tests passed in 288 seconds

Given full test usually take Tests passed in 1103 seconds, I think it will only increase 3ish more minutes even if we run this with all Python executables.

HyukjinKwon · 2018-01-10T13:18:08Z

@icexelloss, for #20204 (comment), yes, that's the way I usually use too. My worry is though I wonder if this is a proper official way to do it because I have been thinking this way is rather meant to be internal.

For this reason, I left a comment in this script for now:

+# If you'd like to run a specific unittest class, you could do such as
+# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
+./run-tests $@

If I remember this correctly, I think I also had a short talk about it with @nchammas and @BryanCutler before (I think it was about more detailed control tho).

I think we should better fix run_tests script to accept the unittest class as an option. I took a look this before and I guess it won't be too difficult to introduce another option. Once we fix it, it will also be available in this script too because the script here wraps run_tests.

Will probably try to take a look again and open another PR separately (maybe within the following week?)

HyukjinKwon · 2018-01-10T13:25:27Z

looks reasonable to me.
how are we going to track changes to coverage number?

Yea, that's a good point. I would like to do it too because other tools allow it. I think this is another place where I should investigate more when I actually give a shot to integrate it with Jenkins. Will cc you in the JIRA, share my investigation, and try to do this in the followup I mentioned in the PR description.

HyukjinKwon · 2018-01-10T13:30:53Z

python/test_coverage/coverage_daemon.py

+import imp
+
+
+# This is a hack to always refer the main code rather than built zip.


So, we will always track the coverage in the main code under python/pyspark, not the built zip Python library.

BryanCutler

This is great, thanks for working on it!

BryanCutler · 2018-01-10T22:51:26Z

python/test_coverage/sitecustomize.py

+#
+
+import coverage
+coverage.process_startup()


would you mind explaining how this file is called and used? I can't quite figure it out..

Yup, will add some comments.

icexelloss · 2018-01-11T04:13:08Z

@icexelloss, for #20204 (comment), yes, that's the way I usually use too. My worry is though I wonder if this is a proper official way to do it because I have been thinking this way is rather meant to be internal.

Sounds good to me. Thanks!

SparkQA · 2018-01-11T17:17:45Z

Test build #85969 has finished for PR 20204 at commit e8e7112.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-11T22:27:58Z

retest this please

SparkQA · 2018-01-12T02:42:51Z

Test build #85994 has finished for PR 20204 at commit e8e7112.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-01-12T07:38:00Z

python/run-tests-with-coverage

+
+# If you'd like to run a specific unittest class, you could do such as
+# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
+./run-tests $@


nit: "$@" instead of $@, just in case.

Yeap, will update.

SparkQA · 2018-01-13T17:49:53Z

Test build #86096 has finished for PR 20204 at commit df9af13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-15T03:45:31Z

Hi @rxin, @felixcheung, @ueshin, @holdenk and @JoshRosen. Could you take another look please when you guys have some time?

@yhuai, I believe this is nice to have to prevent the previous mistake like the kafka and flume one before.

ueshin

LGTM except for one comment.

ueshin · 2018-01-16T03:45:24Z

python/run-tests-with-coverage

+# and generate HTMLs, for example, 'coverage3' in Python 3.
+COV_EXEC="${COV_EXEC:-coverage}"
+FWDIR="$(cd "`dirname $0`"; pwd)"
+pushd "$FWDIR" > /dev/null


Do we need to use pushd and its corresponding popd at the end of this file? I guess we can simply use cd here.

my 2 c: I think it's ok, I'd prefer it; might be useful in the future when more cd are added

I see, no problem at all. I just wanted to confirm. Thanks!

felixcheung

LGTM
perhaps great to follow up with some steps on how to run it, what to look for, how to diff changes :)
but this is a very dev centric thing and you have a lot of comments in-line, so I'd say we are good here.

felixcheung · 2018-01-16T07:41:38Z

python/run-tests-with-coverage

+# and generate HTMLs, for example, 'coverage3' in Python 3.
+COV_EXEC="${COV_EXEC:-coverage}"
+FWDIR="$(cd "`dirname $0`"; pwd)"
+pushd "$FWDIR" > /dev/null


my 2 c: I think it's ok, I'd prefer it; might be useful in the future when more cd are added

HyukjinKwon · 2018-01-18T02:02:43Z

retest this please

SparkQA · 2018-01-18T05:59:20Z

Test build #86310 has finished for PR 20204 at commit df9af13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-20T15:33:30Z

python/run-tests-with-coverage

+
+# If you'd like to run a specific unittest class, you could do such as
+# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
+./run-tests "$@"


Another tip is, if we use ../bin/pyspark here, do some simple tests and then exit, it looks still producing the coverage correctly.

HyukjinKwon · 2018-01-20T15:34:20Z

Will merge this one if there's no more comments in few days.

HyukjinKwon · 2018-01-22T13:13:21Z

Merged to master.

… via Jenkins ## What changes were proposed in this pull request? ### Background For the current status, the test script that generates coverage information was merged into Spark, apache#20204 So, we can generate the coverage report and site by, for example: ``` run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql ``` like `run-tests` script in `./python`. ### Proposed change The next step is to host this coverage report via `github.io` automatically by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/). This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor. To cut this short, this PR targets to run the coverage in [spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/) In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below: ```bash # Clone PySpark coverage site. git clone https://github.com/spark-test/pyspark-coverage-site.git # Remove existing HTMLs. rm -fr pyspark-coverage-site/* # Copy generated coverage HTMLs. cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/ # Check out to a temporary branch. git symbolic-ref HEAD refs/heads/latest_branch # Add all the files. git add -A # Commit current HTMLs. git commit -am "Coverage report at latest commit in Apache Spark" # Delete the old branch. git branch -D gh-pages # Rename the temporary branch to master. git branch -m gh-pages # Finally, force update to our repository. git push -f origin gh-pages ``` So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested. ### TODOs - [x] Write a draft HyukjinKwon - [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers - shaneknapp - [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp - [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp - [x] Make PR builder's test passed HyukjinKwon - [x] Fix flaky test related with coverage HyukjinKwon - 6 consecutive passes out of 7 runs This PR will be co-authored with me and shaneknapp ## How was this patch tested? It will be tested via Jenkins. Closes apache#23117 from HyukjinKwon/SPARK-7721. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: hyukjinkwon <[email protected]> Co-authored-by: shane knapp <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…ge script ### What changes were proposed in this pull request? This PR cleans up the obsolete code in PySpark coverage script ### Why are the changes needed? We used to use `coverage_daemon.py` for Python workers to track the coverage of the Python worker side (e.g., the coverage within Python UDF), added in #20204. However, seems it does not work anymore. In fact, it has been multiple years that it stopped working. The approach of replacing the Python worker itself was a bit hacky workaround. We should just get rid of them first, and find a proper way. This should also deflake the scheduled jobs, and speed up the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested via: ```python ./run-tests-with-coverage --python-executables=python3 --testname="pyspark.sql.functions.builtin" ``` ``` ... Finished test(python3): pyspark.sql.tests.test_functions (87s) Tests passed in 87 seconds Combining collected coverage data under ... Creating XML report file at python/coverage.xml Wrote XML report to coverage.xml Reporting the coverage data at /.../spark/python/test_coverage/coverage_data/coverage Name Stmts Miss Branch BrPart Cover ------------------------------------------------------------------------- pyspark/__init__.py 48 7 10 3 76% pyspark/_globals.py 16 3 4 2 75% pyspark/accumulators.py 123 38 26 5 66% pyspark/broadcast.py 121 79 40 3 33% pyspark/conf.py 99 33 50 5 64% pyspark/context.py 451 216 151 26 51% pyspark/errors/__init__.py 3 0 0 0 100% pyspark/errors/error_classes.py 3 0 0 0 100% pyspark/errors/exceptions/__init__.py 0 0 0 0 100% pyspark/errors/exceptions/base.py 91 15 24 4 83% pyspark/errors/exceptions/captured.py 168 81 57 17 48% pyspark/errors/utils.py 34 8 6 2 70% pyspark/files.py 34 15 12 3 57% pyspark/find_spark_home.py 30 24 12 2 19% pyspark/java_gateway.py 114 31 30 12 69% pyspark/join.py 66 58 58 0 6% pyspark/profiler.py 244 182 92 3 22% pyspark/rdd.py 1064 741 378 9 27% pyspark/rddsampler.py 68 50 32 0 18% pyspark/resource/__init__.py 5 0 0 0 100% pyspark/resource/information.py 11 4 4 0 73% pyspark/resource/profile.py 110 82 58 1 27% pyspark/resource/requests.py 139 90 70 0 35% pyspark/resultiterable.py 14 6 2 1 56% pyspark/serializers.py 349 185 90 13 43% pyspark/shuffle.py 397 322 180 1 13% pyspark/sql/__init__.py 14 0 0 0 100% pyspark/sql/catalog.py 203 127 66 2 30% pyspark/sql/column.py 268 78 64 12 67% pyspark/sql/conf.py 40 16 10 3 58% pyspark/sql/context.py 170 95 58 2 47% pyspark/sql/dataframe.py 900 475 459 40 45% pyspark/sql/functions/__init__.py 3 0 0 0 100% pyspark/sql/functions/builtin.py 1741 542 1126 26 76% pyspark/sql/functions/partitioning.py 41 19 18 3 59% pyspark/sql/group.py 81 30 32 3 65% pyspark/sql/observation.py 54 37 22 1 26% pyspark/sql/pandas/__init__.py 1 0 0 0 100% pyspark/sql/pandas/conversion.py 277 249 156 2 8% pyspark/sql/pandas/functions.py 67 49 34 0 18% pyspark/sql/pandas/group_ops.py 89 65 22 2 25% pyspark/sql/pandas/map_ops.py 37 27 10 2 26% pyspark/sql/pandas/serializers.py 381 323 172 0 10% pyspark/sql/pandas/typehints.py 41 32 26 1 15% pyspark/sql/pandas/types.py 407 383 326 1 3% pyspark/sql/pandas/utils.py 29 11 10 5 59% pyspark/sql/profiler.py 80 47 54 1 39% pyspark/sql/readwriter.py 362 253 146 7 27% pyspark/sql/session.py 469 206 228 22 56% pyspark/sql/sql_formatter.py 41 26 16 1 28% pyspark/sql/streaming/__init__.py 4 0 0 0 100% pyspark/sql/streaming/listener.py 400 200 186 1 61% pyspark/sql/streaming/query.py 102 63 40 1 39% pyspark/sql/streaming/readwriter.py 268 207 118 2 21% pyspark/sql/streaming/state.py 100 68 44 0 29% pyspark/sql/tests/__init__.py 0 0 0 0 100% pyspark/sql/tests/test_functions.py 646 2 244 7 99% pyspark/sql/types.py 1013 355 528 74 62% pyspark/sql/udf.py 240 132 90 20 42% pyspark/sql/udtf.py 152 98 52 2 33% pyspark/sql/utils.py 160 83 54 10 45% pyspark/sql/window.py 89 23 56 5 77% pyspark/statcounter.py 79 58 20 0 21% pyspark/status.py 36 13 6 0 55% pyspark/storagelevel.py 41 9 0 0 78% pyspark/taskcontext.py 111 63 40 1 40% pyspark/testing/__init__.py 2 0 0 0 100% pyspark/testing/sqlutils.py 149 44 52 1 75% pyspark/testing/utils.py 312 238 162 2 17% pyspark/traceback_utils.py 38 4 14 6 81% pyspark/util.py 153 120 56 2 18% pyspark/version.py 1 0 0 0 100% ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44842 from HyukjinKwon/SPARK-46802. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

HyukjinKwon commented Jan 9, 2018

View reviewed changes

HyukjinKwon force-pushed the python-coverage branch 2 times, most recently from 3c3c3cb to dadcae4 Compare January 9, 2018 18:05

felixcheung reviewed Jan 10, 2018

View reviewed changes

HyukjinKwon commented Jan 10, 2018

View reviewed changes

BryanCutler reviewed Jan 10, 2018

View reviewed changes

HyukjinKwon added 2 commits January 11, 2018 23:13

Adds PySpark coverage generation script

27f0671

Add some comments for sitecustomize and coverage

e8e7112

HyukjinKwon force-pushed the python-coverage branch from dadcae4 to e8e7112 Compare January 11, 2018 14:14

ueshin reviewed Jan 12, 2018

View reviewed changes

Make it "$@"

df9af13

ueshin reviewed Jan 16, 2018

View reviewed changes

felixcheung approved these changes Jan 16, 2018

View reviewed changes

HyukjinKwon commented Jan 20, 2018

View reviewed changes

asfgit closed this in 87ffe7a Jan 22, 2018

HyukjinKwon deleted the python-coverage branch October 16, 2018 12:44

HyukjinKwon mentioned this pull request Nov 22, 2018

[SPARK-7721][INFRA] Run and generate test coverage report from Python via Jenkins #23117

Closed

6 tasks

HyukjinKwon mentioned this pull request Jan 23, 2024

[SPARK-46802][PYTHON][TESTS] Clean up obsolete code in PySpark coverage script #44842

Closed

		import imp


		# This is a hack to always refer the main code rather than built zip.

[SPARK-7721][PYTHON][TESTS] Adds PySpark coverage generation script #20204

[SPARK-7721][PYTHON][TESTS] Adds PySpark coverage generation script #20204

Conversation

HyukjinKwon commented Jan 9, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

HyukjinKwon commented Jan 9, 2018

HyukjinKwon commented Jan 9, 2018

icexelloss commented Jan 9, 2018

HyukjinKwon commented Jan 9, 2018 • edited Loading

SparkQA commented Jan 9, 2018

SparkQA commented Jan 9, 2018

SparkQA commented Jan 9, 2018

SparkQA commented Jan 9, 2018

icexelloss commented Jan 9, 2018

felixcheung left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 10, 2018 • edited Loading

HyukjinKwon commented Jan 10, 2018

HyukjinKwon commented Jan 10, 2018

Choose a reason for hiding this comment

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss commented Jan 11, 2018

SparkQA commented Jan 11, 2018

HyukjinKwon commented Jan 11, 2018

SparkQA commented Jan 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 13, 2018

HyukjinKwon commented Jan 15, 2018

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Jan 18, 2018

SparkQA commented Jan 18, 2018

Choose a reason for hiding this comment

HyukjinKwon commented Jan 20, 2018

HyukjinKwon commented Jan 22, 2018

HyukjinKwon commented Jan 9, 2018 •

edited

Loading

HyukjinKwon commented Jan 9, 2018 •

edited

Loading

HyukjinKwon commented Jan 10, 2018 •

edited

Loading