-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-7721][PYTHON][TESTS] Adds PySpark coverage generation script #20204
Conversation
@@ -175,6 +175,9 @@ def main(): | |||
|
|||
task_queue = Queue.PriorityQueue() | |||
for python_exec in python_execs: | |||
if "COVERAGE_PROCESS_START" in os.environ: | |||
# Make sure if coverage is installed. | |||
run_cmd([python_exec, "-c", "import coverage"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually tested:
./run-tests-with-coverage --python-executables=python3
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'foo'
[error] running python3 -c import foo ; received return code 1
cc @rxin, @felixcheung, @ueshin, @holdenk and @JoshRosen. Here, I made it based on #20151. Could you take a look for this one too? I can make this PR separate from #20151 by manually adding the custom fix into |
cc @icexelloss too. I think we had a small talk about this before. |
This is awesome. Thanks @HyukjinKwon! |
3c3c3cb
to
dadcae4
Compare
Sorry for noise. I made some clean up and just rebased. |
Test build #85856 has finished for PR 20204 at commit
|
Test build #85855 has finished for PR 20204 at commit
|
Test build #85858 has finished for PR 20204 at commit
|
Test build #85862 has finished for PR 20204 at commit
|
I usually run tests like this:
Does it make sense to expose similar usage for enabling coverage? Sth like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks reasonable to me.
how are we going to track changes to coverage number?
If anyone is worried of elapsed time, I think seems generally fine: Before:
After:
Given full test usually take |
@icexelloss, for #20204 (comment), yes, that's the way I usually use too. My worry is though I wonder if this is a proper official way to do it because I have been thinking this way is rather meant to be internal. For this reason, I left a comment in this script for now: +# If you'd like to run a specific unittest class, you could do such as
+# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
+./run-tests $@ If I remember this correctly, I think I also had a short talk about it with @nchammas and @BryanCutler before (I think it was about more detailed control tho). I think we should better fix Will probably try to take a look again and open another PR separately (maybe within the following week?) |
Yea, that's a good point. I would like to do it too because other tools allow it. I think this is another place where I should investigate more when I actually give a shot to integrate it with Jenkins. Will cc you in the JIRA, share my investigation, and try to do this in the followup I mentioned in the PR description. |
import imp | ||
|
||
|
||
# This is a hack to always refer the main code rather than built zip. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we will always track the coverage in the main code under python/pyspark
, not the built zip Python library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks for working on it!
# | ||
|
||
import coverage | ||
coverage.process_startup() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you mind explaining how this file is called and used? I can't quite figure it out..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will add some comments.
Sounds good to me. Thanks! |
dadcae4
to
e8e7112
Compare
Test build #85969 has finished for PR 20204 at commit
|
retest this please |
Test build #85994 has finished for PR 20204 at commit
|
python/run-tests-with-coverage
Outdated
|
||
# If you'd like to run a specific unittest class, you could do such as | ||
# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests | ||
./run-tests $@ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "$@"
instead of $@
, just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap, will update.
Test build #86096 has finished for PR 20204 at commit
|
Hi @rxin, @felixcheung, @ueshin, @holdenk and @JoshRosen. Could you take another look please when you guys have some time? @yhuai, I believe this is nice to have to prevent the previous mistake like the kafka and flume one before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one comment.
# and generate HTMLs, for example, 'coverage3' in Python 3. | ||
COV_EXEC="${COV_EXEC:-coverage}" | ||
FWDIR="$(cd "`dirname $0`"; pwd)" | ||
pushd "$FWDIR" > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use pushd
and its corresponding popd
at the end of this file? I guess we can simply use cd
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 c: I think it's ok, I'd prefer it; might be useful in the future when more cd are added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, no problem at all. I just wanted to confirm. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
perhaps great to follow up with some steps on how to run it, what to look for, how to diff changes :)
but this is a very dev centric thing and you have a lot of comments in-line, so I'd say we are good here.
# and generate HTMLs, for example, 'coverage3' in Python 3. | ||
COV_EXEC="${COV_EXEC:-coverage}" | ||
FWDIR="$(cd "`dirname $0`"; pwd)" | ||
pushd "$FWDIR" > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my 2 c: I think it's ok, I'd prefer it; might be useful in the future when more cd are added
retest this please |
Test build #86310 has finished for PR 20204 at commit
|
|
||
# If you'd like to run a specific unittest class, you could do such as | ||
# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests | ||
./run-tests "$@" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another tip is, if we use ../bin/pyspark
here, do some simple tests and then exit, it looks still producing the coverage correctly.
Will merge this one if there's no more comments in few days. |
Merged to master. |
… via Jenkins ## What changes were proposed in this pull request? ### Background For the current status, the test script that generates coverage information was merged into Spark, apache#20204 So, we can generate the coverage report and site by, for example: ``` run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql ``` like `run-tests` script in `./python`. ### Proposed change The next step is to host this coverage report via `github.io` automatically by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/). This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor. To cut this short, this PR targets to run the coverage in [spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/) In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below: ```bash # Clone PySpark coverage site. git clone https://github.com/spark-test/pyspark-coverage-site.git # Remove existing HTMLs. rm -fr pyspark-coverage-site/* # Copy generated coverage HTMLs. cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/ # Check out to a temporary branch. git symbolic-ref HEAD refs/heads/latest_branch # Add all the files. git add -A # Commit current HTMLs. git commit -am "Coverage report at latest commit in Apache Spark" # Delete the old branch. git branch -D gh-pages # Rename the temporary branch to master. git branch -m gh-pages # Finally, force update to our repository. git push -f origin gh-pages ``` So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested. ### TODOs - [x] Write a draft HyukjinKwon - [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers - shaneknapp - [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp - [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp - [x] Make PR builder's test passed HyukjinKwon - [x] Fix flaky test related with coverage HyukjinKwon - 6 consecutive passes out of 7 runs This PR will be co-authored with me and shaneknapp ## How was this patch tested? It will be tested via Jenkins. Closes apache#23117 from HyukjinKwon/SPARK-7721. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: hyukjinkwon <[email protected]> Co-authored-by: shane knapp <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ge script ### What changes were proposed in this pull request? This PR cleans up the obsolete code in PySpark coverage script ### Why are the changes needed? We used to use `coverage_daemon.py` for Python workers to track the coverage of the Python worker side (e.g., the coverage within Python UDF), added in #20204. However, seems it does not work anymore. In fact, it has been multiple years that it stopped working. The approach of replacing the Python worker itself was a bit hacky workaround. We should just get rid of them first, and find a proper way. This should also deflake the scheduled jobs, and speed up the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested via: ```python ./run-tests-with-coverage --python-executables=python3 --testname="pyspark.sql.functions.builtin" ``` ``` ... Finished test(python3): pyspark.sql.tests.test_functions (87s) Tests passed in 87 seconds Combining collected coverage data under ... Creating XML report file at python/coverage.xml Wrote XML report to coverage.xml Reporting the coverage data at /.../spark/python/test_coverage/coverage_data/coverage Name Stmts Miss Branch BrPart Cover ------------------------------------------------------------------------- pyspark/__init__.py 48 7 10 3 76% pyspark/_globals.py 16 3 4 2 75% pyspark/accumulators.py 123 38 26 5 66% pyspark/broadcast.py 121 79 40 3 33% pyspark/conf.py 99 33 50 5 64% pyspark/context.py 451 216 151 26 51% pyspark/errors/__init__.py 3 0 0 0 100% pyspark/errors/error_classes.py 3 0 0 0 100% pyspark/errors/exceptions/__init__.py 0 0 0 0 100% pyspark/errors/exceptions/base.py 91 15 24 4 83% pyspark/errors/exceptions/captured.py 168 81 57 17 48% pyspark/errors/utils.py 34 8 6 2 70% pyspark/files.py 34 15 12 3 57% pyspark/find_spark_home.py 30 24 12 2 19% pyspark/java_gateway.py 114 31 30 12 69% pyspark/join.py 66 58 58 0 6% pyspark/profiler.py 244 182 92 3 22% pyspark/rdd.py 1064 741 378 9 27% pyspark/rddsampler.py 68 50 32 0 18% pyspark/resource/__init__.py 5 0 0 0 100% pyspark/resource/information.py 11 4 4 0 73% pyspark/resource/profile.py 110 82 58 1 27% pyspark/resource/requests.py 139 90 70 0 35% pyspark/resultiterable.py 14 6 2 1 56% pyspark/serializers.py 349 185 90 13 43% pyspark/shuffle.py 397 322 180 1 13% pyspark/sql/__init__.py 14 0 0 0 100% pyspark/sql/catalog.py 203 127 66 2 30% pyspark/sql/column.py 268 78 64 12 67% pyspark/sql/conf.py 40 16 10 3 58% pyspark/sql/context.py 170 95 58 2 47% pyspark/sql/dataframe.py 900 475 459 40 45% pyspark/sql/functions/__init__.py 3 0 0 0 100% pyspark/sql/functions/builtin.py 1741 542 1126 26 76% pyspark/sql/functions/partitioning.py 41 19 18 3 59% pyspark/sql/group.py 81 30 32 3 65% pyspark/sql/observation.py 54 37 22 1 26% pyspark/sql/pandas/__init__.py 1 0 0 0 100% pyspark/sql/pandas/conversion.py 277 249 156 2 8% pyspark/sql/pandas/functions.py 67 49 34 0 18% pyspark/sql/pandas/group_ops.py 89 65 22 2 25% pyspark/sql/pandas/map_ops.py 37 27 10 2 26% pyspark/sql/pandas/serializers.py 381 323 172 0 10% pyspark/sql/pandas/typehints.py 41 32 26 1 15% pyspark/sql/pandas/types.py 407 383 326 1 3% pyspark/sql/pandas/utils.py 29 11 10 5 59% pyspark/sql/profiler.py 80 47 54 1 39% pyspark/sql/readwriter.py 362 253 146 7 27% pyspark/sql/session.py 469 206 228 22 56% pyspark/sql/sql_formatter.py 41 26 16 1 28% pyspark/sql/streaming/__init__.py 4 0 0 0 100% pyspark/sql/streaming/listener.py 400 200 186 1 61% pyspark/sql/streaming/query.py 102 63 40 1 39% pyspark/sql/streaming/readwriter.py 268 207 118 2 21% pyspark/sql/streaming/state.py 100 68 44 0 29% pyspark/sql/tests/__init__.py 0 0 0 0 100% pyspark/sql/tests/test_functions.py 646 2 244 7 99% pyspark/sql/types.py 1013 355 528 74 62% pyspark/sql/udf.py 240 132 90 20 42% pyspark/sql/udtf.py 152 98 52 2 33% pyspark/sql/utils.py 160 83 54 10 45% pyspark/sql/window.py 89 23 56 5 77% pyspark/statcounter.py 79 58 20 0 21% pyspark/status.py 36 13 6 0 55% pyspark/storagelevel.py 41 9 0 0 78% pyspark/taskcontext.py 111 63 40 1 40% pyspark/testing/__init__.py 2 0 0 0 100% pyspark/testing/sqlutils.py 149 44 52 1 75% pyspark/testing/utils.py 312 238 162 2 17% pyspark/traceback_utils.py 38 4 14 6 81% pyspark/util.py 153 120 56 2 18% pyspark/version.py 1 0 0 0 100% ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44842 from HyukjinKwon/SPARK-46802. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Note that this PR was made based on the top of #20151. So, it almost leaves the main codes intact.
This PR proposes to add a script for the preparation of automatic PySpark coverage generation. Now, it's difficult to check the actual coverage in case of PySpark. With this script, it allows to run tests by the way we did via
run-tests
script before. The usage is exactly the same withrun-tests
script as this basically wraps it.This script and PR alone should also be useful. I was asked about how to run this before, and seems some reviewers (including me) need this. It would be also useful to run it manually.
It usually requires a small diff in normal Python projects but PySpark cases are a bit different because apparently we are unable to track the coverage after it's forked. So, here, I made a custom worker that forces the coverage, based on the top of #20151.
I made a simple demo. Please take a look - https://spark-test.github.io/pyspark-coverage-site.
To show up the structure, this PR adds the files as below:
Note that this PR has a minor nit:
This scope in
daemon.py
is not in the coverage results as basically I am producing the coverage results inworker.py
separately and then merging it. I believe it's not a big deal.In a followup, I might have a site that has a single up-to-date PySpark coverage from the master branch as the fallback / default, or have a site that has multiple PySpark coverages and the site link will be left to each pull request.
How was this patch tested?
Manually tested. Usage is the same with the existing Python test script -
./python/run-tests
. For example,Running this will generate HTMLs under
./python/test_coverage/htmlcov
.Console output example: