Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7721][PYTHON][TESTS] Adds PySpark coverage generation script #20204

Closed
wants to merge 3 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 9, 2018

What changes were proposed in this pull request?

Note that this PR was made based on the top of #20151. So, it almost leaves the main codes intact.

This PR proposes to add a script for the preparation of automatic PySpark coverage generation. Now, it's difficult to check the actual coverage in case of PySpark. With this script, it allows to run tests by the way we did via run-tests script before. The usage is exactly the same with run-tests script as this basically wraps it.

This script and PR alone should also be useful. I was asked about how to run this before, and seems some reviewers (including me) need this. It would be also useful to run it manually.

It usually requires a small diff in normal Python projects but PySpark cases are a bit different because apparently we are unable to track the coverage after it's forked. So, here, I made a custom worker that forces the coverage, based on the top of #20151.

I made a simple demo. Please take a look - https://spark-test.github.io/pyspark-coverage-site.

To show up the structure, this PR adds the files as below:

python
├── .coveragerc  # Runtime configuration when we run the script.
├── run-tests-with-coverage  # The script that has coverage support and wraps run-tests script.
└── test_coverage  # Directories that have files required when running coverage.
    ├── conf
    │   └── spark-defaults.conf  # Having the configuration 'spark.python.daemon.module'.
    ├── coverage_daemon.py  # A daemon having custom fix and wrapping our daemon.py
    └── sitecustomize.py  # Initiate coverage with COVERAGE_PROCESS_START

Note that this PR has a minor nit:

This scope in daemon.py is not in the coverage results as basically I am producing the coverage results in worker.py separately and then merging it. I believe it's not a big deal.

In a followup, I might have a site that has a single up-to-date PySpark coverage from the master branch as the fallback / default, or have a site that has multiple PySpark coverages and the site link will be left to each pull request.

How was this patch tested?

Manually tested. Usage is the same with the existing Python test script - ./python/run-tests. For example,

sh run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql

Running this will generate HTMLs under ./python/test_coverage/htmlcov.

Console output example:

sh run-tests-with-coverage --python-executables=python3,python --modules=pyspark-core
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python3', 'python']
Will test the following Python modules: ['pyspark-core']
Starting test(python): pyspark.tests
Starting test(python3): pyspark.tests
...
Tests passed in 231 seconds
Combining collected coverage data under /.../spark/python/test_coverage/coverage_data
Reporting the coverage data at /...spark/python/test_coverage/coverage_data/coverage
Name                         Stmts   Miss Branch BrPart  Cover
--------------------------------------------------------------
pyspark/__init__.py             41      0      8      2    96%
...
pyspark/profiler.py             74     11     22      5    83%
pyspark/rdd.py                 871     40    303     32    93%
pyspark/rddsampler.py           68     10     32      2    82%
...
--------------------------------------------------------------
TOTAL                         8521   3077   2748    191    59%
Generating HTML files for PySpark coverage under /.../spark/python/test_coverage/htmlcov

@@ -175,6 +175,9 @@ def main():

task_queue = Queue.PriorityQueue()
for python_exec in python_execs:
if "COVERAGE_PROCESS_START" in os.environ:
# Make sure if coverage is installed.
run_cmd([python_exec, "-c", "import coverage"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually tested:

./run-tests-with-coverage --python-executables=python3
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python3']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'foo'
[error] running python3 -c import foo ; received return code 1

@HyukjinKwon
Copy link
Member Author

cc @rxin, @felixcheung, @ueshin, @holdenk and @JoshRosen. Here, I made it based on #20151. Could you take a look for this one too? I can make this PR separate from #20151 by manually adding the custom fix into worker.py if requested.

@HyukjinKwon
Copy link
Member Author

cc @icexelloss too. I think we had a small talk about this before.

@icexelloss
Copy link
Contributor

This is awesome. Thanks @HyukjinKwon!

@HyukjinKwon HyukjinKwon force-pushed the python-coverage branch 2 times, most recently from 3c3c3cb to dadcae4 Compare January 9, 2018 18:05
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jan 9, 2018

Sorry for noise. I made some clean up and just rebased.

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85856 has finished for PR 20204 at commit 9f2c400.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85855 has finished for PR 20204 at commit a3179d7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85858 has finished for PR 20204 at commit 3c3c3cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85862 has finished for PR 20204 at commit dadcae4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@icexelloss
Copy link
Contributor

I usually run tests like this:

SPARK_TESTING=1 bin/pyspark pyspark.sql.tests ...

Does it make sense to expose similar usage for enabling coverage? Sth like

SPARK_TESTING_WITH_COVERAGE=1 bin/pyspark pyspark.sql.tests ...

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable to me.
how are we going to track changes to coverage number?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jan 10, 2018

If anyone is worried of elapsed time, I think seems generally fine:

Before:

sh run-tests --python-executables=python3 --modules=pyspark-sql
...
Finished test(python3): pyspark.sql.readwriter (34s)
Finished test(python3): pyspark.sql.tests (241s)
Tests passed in 241 seconds

After:

sh run-tests-with-coverage  --python-executables=python3 --modules=pyspark-sql
...
Finished test(python3): pyspark.sql.conf (288s)
Finished test(python3): pyspark.sql.tests (288s)
Tests passed in 288 seconds

Given full test usually take Tests passed in 1103 seconds, I think it will only increase 3ish more minutes even if we run this with all Python executables.

@HyukjinKwon
Copy link
Member Author

@icexelloss, for #20204 (comment), yes, that's the way I usually use too. My worry is though I wonder if this is a proper official way to do it because I have been thinking this way is rather meant to be internal.

For this reason, I left a comment in this script for now:

+# If you'd like to run a specific unittest class, you could do such as
+# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
+./run-tests $@

If I remember this correctly, I think I also had a short talk about it with @nchammas and @BryanCutler before (I think it was about more detailed control tho).

I think we should better fix run_tests script to accept the unittest class as an option. I took a look this before and I guess it won't be too difficult to introduce another option. Once we fix it, it will also be available in this script too because the script here wraps run_tests.

Will probably try to take a look again and open another PR separately (maybe within the following week?)

@HyukjinKwon
Copy link
Member Author

looks reasonable to me.
how are we going to track changes to coverage number?

Yea, that's a good point. I would like to do it too because other tools allow it. I think this is another place where I should investigate more when I actually give a shot to integrate it with Jenkins. Will cc you in the JIRA, share my investigation, and try to do this in the followup I mentioned in the PR description.

import imp


# This is a hack to always refer the main code rather than built zip.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we will always track the coverage in the main code under python/pyspark, not the built zip Python library.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks for working on it!

#

import coverage
coverage.process_startup()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind explaining how this file is called and used? I can't quite figure it out..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, will add some comments.

@icexelloss
Copy link
Contributor

@icexelloss, for #20204 (comment), yes, that's the way I usually use too. My worry is though I wonder if this is a proper official way to do it because I have been thinking this way is rather meant to be internal.

Sounds good to me. Thanks!

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85969 has finished for PR 20204 at commit e8e7112.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #85994 has finished for PR 20204 at commit e8e7112.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.


# If you'd like to run a specific unittest class, you could do such as
# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
./run-tests $@
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "$@" instead of $@, just in case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, will update.

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86096 has finished for PR 20204 at commit df9af13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

Hi @rxin, @felixcheung, @ueshin, @holdenk and @JoshRosen. Could you take another look please when you guys have some time?

@yhuai, I believe this is nice to have to prevent the previous mistake like the kafka and flume one before.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one comment.

# and generate HTMLs, for example, 'coverage3' in Python 3.
COV_EXEC="${COV_EXEC:-coverage}"
FWDIR="$(cd "`dirname $0`"; pwd)"
pushd "$FWDIR" > /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use pushd and its corresponding popd at the end of this file? I guess we can simply use cd here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 c: I think it's ok, I'd prefer it; might be useful in the future when more cd are added

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, no problem at all. I just wanted to confirm. Thanks!

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
perhaps great to follow up with some steps on how to run it, what to look for, how to diff changes :)
but this is a very dev centric thing and you have a lot of comments in-line, so I'd say we are good here.

# and generate HTMLs, for example, 'coverage3' in Python 3.
COV_EXEC="${COV_EXEC:-coverage}"
FWDIR="$(cd "`dirname $0`"; pwd)"
pushd "$FWDIR" > /dev/null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 c: I think it's ok, I'd prefer it; might be useful in the future when more cd are added

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 18, 2018

Test build #86310 has finished for PR 20204 at commit df9af13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


# If you'd like to run a specific unittest class, you could do such as
# SPARK_TESTING=1 ../bin/pyspark pyspark.sql.tests VectorizedUDFTests
./run-tests "$@"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another tip is, if we use ../bin/pyspark here, do some simple tests and then exit, it looks still producing the coverage correctly.

@HyukjinKwon
Copy link
Member Author

Will merge this one if there's no more comments in few days.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@asfgit asfgit closed this in 87ffe7a Jan 22, 2018
@HyukjinKwon HyukjinKwon deleted the python-coverage branch October 16, 2018 12:44
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
… via Jenkins

## What changes were proposed in this pull request?

### Background

For the current status, the test script that generates coverage information was merged
into Spark, apache#20204

So, we can generate the coverage report and site by, for example:

```
run-tests-with-coverage --python-executables=python3 --modules=pyspark-sql
```

like `run-tests` script in `./python`.

### Proposed change

The next step is to host this coverage report via `github.io` automatically
by Jenkins (see https://spark-test.github.io/pyspark-coverage-site/).

This uses my testing account for Spark, spark-test, which is shared to Felix and Shivaram a long time ago for testing purpose including AppVeyor.

To cut this short, this PR targets to run the coverage in
[spark-master-test-sbt-hadoop-2.7](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/)

In the specific job, it will clone the page, and rebase the up-to-date PySpark test coverage from the latest commit. For instance as below:

```bash
# Clone PySpark coverage site.
git clone https://github.com/spark-test/pyspark-coverage-site.git

# Remove existing HTMLs.
rm -fr pyspark-coverage-site/*

# Copy generated coverage HTMLs.
cp -r .../python/test_coverage/htmlcov/* pyspark-coverage-site/

# Check out to a temporary branch.
git symbolic-ref HEAD refs/heads/latest_branch

# Add all the files.
git add -A

# Commit current HTMLs.
git commit -am "Coverage report at latest commit in Apache Spark"

# Delete the old branch.
git branch -D gh-pages

# Rename the temporary branch to master.
git branch -m gh-pages

# Finally, force update to our repository.
git push -f origin gh-pages
```

So, it is a one single up-to-date coverage can be shown in the `github-io` page. The commands above were manually tested.

### TODOs

- [x] Write a draft HyukjinKwon
- [x] `pip install coverage` to all python implementations (pypy, python2, python3) in Jenkins workers  - shaneknapp
- [x] Set hidden `SPARK_TEST_KEY` for spark-test's password in Jenkins via Jenkins's feature
 This should be set in both PR builder and `spark-master-test-sbt-hadoop-2.7` so that later other PRs can test and fix the bugs - shaneknapp
- [x] Set an environment variable that indicates `spark-master-test-sbt-hadoop-2.7` so that that specific build can report and update the coverage site - shaneknapp
- [x] Make PR builder's test passed HyukjinKwon
- [x] Fix flaky test related with coverage HyukjinKwon
  -  6 consecutive passes out of 7 runs

This PR will be co-authored with me and shaneknapp

## How was this patch tested?

It will be tested via Jenkins.

Closes apache#23117 from HyukjinKwon/SPARK-7721.

Lead-authored-by: Hyukjin Kwon <[email protected]>
Co-authored-by: hyukjinkwon <[email protected]>
Co-authored-by: shane knapp <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Jan 23, 2024
…ge script

### What changes were proposed in this pull request?

This PR cleans up the obsolete code in PySpark coverage script

### Why are the changes needed?

We used to use `coverage_daemon.py` for Python workers to track the coverage of the Python worker side (e.g., the coverage within Python UDF), added in #20204. However, seems it does not work anymore. In fact, it has been multiple years that it stopped working. The approach of replacing the Python worker itself was a bit hacky workaround. We should just get rid of them first, and find a proper way.

This should also deflake the scheduled jobs, and speed up the build.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested via:

```python
./run-tests-with-coverage --python-executables=python3 --testname="pyspark.sql.functions.builtin"
```

```
...
Finished test(python3): pyspark.sql.tests.test_functions (87s)
Tests passed in 87 seconds
Combining collected coverage data under
...
Creating XML report file at python/coverage.xml
Wrote XML report to coverage.xml
Reporting the coverage data at /.../spark/python/test_coverage/coverage_data/coverage
Name                                    Stmts   Miss Branch BrPart  Cover
-------------------------------------------------------------------------
pyspark/__init__.py                        48      7     10      3    76%
pyspark/_globals.py                        16      3      4      2    75%
pyspark/accumulators.py                   123     38     26      5    66%
pyspark/broadcast.py                      121     79     40      3    33%
pyspark/conf.py                            99     33     50      5    64%
pyspark/context.py                        451    216    151     26    51%
pyspark/errors/__init__.py                  3      0      0      0   100%
pyspark/errors/error_classes.py             3      0      0      0   100%
pyspark/errors/exceptions/__init__.py       0      0      0      0   100%
pyspark/errors/exceptions/base.py          91     15     24      4    83%
pyspark/errors/exceptions/captured.py     168     81     57     17    48%
pyspark/errors/utils.py                    34      8      6      2    70%
pyspark/files.py                           34     15     12      3    57%
pyspark/find_spark_home.py                 30     24     12      2    19%
pyspark/java_gateway.py                   114     31     30     12    69%
pyspark/join.py                            66     58     58      0     6%
pyspark/profiler.py                       244    182     92      3    22%
pyspark/rdd.py                           1064    741    378      9    27%
pyspark/rddsampler.py                      68     50     32      0    18%
pyspark/resource/__init__.py                5      0      0      0   100%
pyspark/resource/information.py            11      4      4      0    73%
pyspark/resource/profile.py               110     82     58      1    27%
pyspark/resource/requests.py              139     90     70      0    35%
pyspark/resultiterable.py                  14      6      2      1    56%
pyspark/serializers.py                    349    185     90     13    43%
pyspark/shuffle.py                        397    322    180      1    13%
pyspark/sql/__init__.py                    14      0      0      0   100%
pyspark/sql/catalog.py                    203    127     66      2    30%
pyspark/sql/column.py                     268     78     64     12    67%
pyspark/sql/conf.py                        40     16     10      3    58%
pyspark/sql/context.py                    170     95     58      2    47%
pyspark/sql/dataframe.py                  900    475    459     40    45%
pyspark/sql/functions/__init__.py           3      0      0      0   100%
pyspark/sql/functions/builtin.py         1741    542   1126     26    76%
pyspark/sql/functions/partitioning.py      41     19     18      3    59%
pyspark/sql/group.py                       81     30     32      3    65%
pyspark/sql/observation.py                 54     37     22      1    26%
pyspark/sql/pandas/__init__.py              1      0      0      0   100%
pyspark/sql/pandas/conversion.py          277    249    156      2     8%
pyspark/sql/pandas/functions.py            67     49     34      0    18%
pyspark/sql/pandas/group_ops.py            89     65     22      2    25%
pyspark/sql/pandas/map_ops.py              37     27     10      2    26%
pyspark/sql/pandas/serializers.py         381    323    172      0    10%
pyspark/sql/pandas/typehints.py            41     32     26      1    15%
pyspark/sql/pandas/types.py               407    383    326      1     3%
pyspark/sql/pandas/utils.py                29     11     10      5    59%
pyspark/sql/profiler.py                    80     47     54      1    39%
pyspark/sql/readwriter.py                 362    253    146      7    27%
pyspark/sql/session.py                    469    206    228     22    56%
pyspark/sql/sql_formatter.py               41     26     16      1    28%
pyspark/sql/streaming/__init__.py           4      0      0      0   100%
pyspark/sql/streaming/listener.py         400    200    186      1    61%
pyspark/sql/streaming/query.py            102     63     40      1    39%
pyspark/sql/streaming/readwriter.py       268    207    118      2    21%
pyspark/sql/streaming/state.py            100     68     44      0    29%
pyspark/sql/tests/__init__.py               0      0      0      0   100%
pyspark/sql/tests/test_functions.py       646      2    244      7    99%
pyspark/sql/types.py                     1013    355    528     74    62%
pyspark/sql/udf.py                        240    132     90     20    42%
pyspark/sql/udtf.py                       152     98     52      2    33%
pyspark/sql/utils.py                      160     83     54     10    45%
pyspark/sql/window.py                      89     23     56      5    77%
pyspark/statcounter.py                     79     58     20      0    21%
pyspark/status.py                          36     13      6      0    55%
pyspark/storagelevel.py                    41      9      0      0    78%
pyspark/taskcontext.py                    111     63     40      1    40%
pyspark/testing/__init__.py                 2      0      0      0   100%
pyspark/testing/sqlutils.py               149     44     52      1    75%
pyspark/testing/utils.py                  312    238    162      2    17%
pyspark/traceback_utils.py                 38      4     14      6    81%
pyspark/util.py                           153    120     56      2    18%
pyspark/version.py                          1      0      0      0   100%
...
```

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44842 from HyukjinKwon/SPARK-46802.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants