Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable coverage #16913

Merged
merged 7 commits into from
Mar 23, 2023
Merged

Conversation

tokenrove
Copy link
Contributor

@tokenrove tokenrove commented Jan 3, 2023

This has been languishing for a bit, and the "hard part" is figuring out exactly what llvm-cov invocations we want to best present the information, what tests should run in nightly to get useful coverage information, and so on. So I'm going to punt on these bits and aim to get the rest merged, which enables using mzcompose with --coverage and does demonstrate adding the fast sqllogictests with coverage to nightly, which is useful as a starting point to experiment with that coverage data. We can iterate on the rest.

Motivation

  • This PR adds a known-desirable feature, see MaterializeInc/database-issues#4781.

Checklist

@tokenrove tokenrove force-pushed the julian/reenable-coverage branch 3 times, most recently from b4fb7f6 to 843e26d Compare January 11, 2023 16:43
@tokenrove tokenrove force-pushed the julian/reenable-coverage branch from 843e26d to 93210e5 Compare January 16, 2023 15:28
@tokenrove tokenrove force-pushed the julian/reenable-coverage branch 7 times, most recently from 38b1b3a to 0e3ca6a Compare February 13, 2023 13:11
[
"rust-cov",
"export",
"--format=lcov",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to export as lcov?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in the middle of experimenting with a bunch of ways of presenting the output. Originally I exported the JSON, which no tools seem to consume at all... so at least lcov output can be consumed by lcov's genhtml. Which I know is more useful than llvm-cov show's own HTML format which is completely unusable because it produces a giant, multi-hundred-megabyte HTML file. Suggestions appreciated though!

"rust-profdata",
"merge",
"-sparse",
*Path("test/sqllogictest/coverage").glob("*.profraw"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure each profraw file has a unique name? environmentd and clusterd might be colliding with same name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the problem with the environment variable set in mzcompose; FYI, reviewing this right now is probably not very useful since I am working on it. Or you could take it over as it is if you want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nono, just wrote down my thoughts as I read it out of interest, not a real review.

@def-
Copy link
Contributor

def- commented Feb 13, 2023

I have a whole bunch of issues open regarding llvm-cov, but that was with C++, I'd be interested to see a coverage report and if things work well with Rust. I basically reimplemented llvm-cov show last time around based on the json export to show coverage as inline reviews on a PR.

@@ -29,6 +29,7 @@ junit_*.xml
perf.data*
**/*.rej
**/*.orig
**/*.profraw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can also add .profdata

@tokenrove tokenrove force-pushed the julian/reenable-coverage branch from 0e3ca6a to 9591786 Compare February 13, 2023 13:55
@def-
Copy link
Contributor

def- commented Mar 22, 2023

@tokenrove Any update on this? I might try working on PR code coverage soon, and would take this PR as the basis for my work.

@tokenrove
Copy link
Contributor Author

Yes, I was going to suggest that I get rid of the commits that add this to nightly, and just get the other commits merged so mzcompose can be invoked with --coverage, and then you can take the rest.

@def-
Copy link
Contributor

def- commented Mar 22, 2023

Alright, that would be fine.

@tokenrove tokenrove force-pushed the julian/reenable-coverage branch from 9591786 to ba58d69 Compare March 22, 2023 18:01
@tokenrove tokenrove changed the title WIP: Re-enable coverage Re-enable coverage Mar 22, 2023
@tokenrove tokenrove marked this pull request as ready for review March 22, 2023 20:05
@tokenrove tokenrove requested a review from a team as a code owner March 22, 2023 20:05
@tokenrove
Copy link
Contributor Author

I haven't re-run the nightly so I haven't checked if this still does the right thing, but I'll do that now.

instrument-coverage has been stable since Rust 1.60.

I am removing the unresolved symbols warning hack for Nix because,
like the commentors who closed the reference issues, it's unclear to
me what would cause this, and I'd rather resolve it by fixing the
actual issue.  Looking at related issues like rust-lang/rust#92769,
it's probably an artifact of the implementation in nightly, which
hopefully is fixed now that it's stable.
Necessary to run rust-profdata and rust-cov for working with coverage
data.
Without this, we end up mutating the same config again, which then
causes docker compose to fail with:

  services.sqllogictest.environment array items[1,2] must be unique
@tokenrove tokenrove force-pushed the julian/reenable-coverage branch from ba58d69 to dce2ab2 Compare March 22, 2023 20:17
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't affect anything except coverage, which is unused currently anyway. So fine by my to submit, then I'll start on https://github.com/MaterializeInc/materialize/issues/18340

The goal here is to avoid sqllogictest and clusterd from overwriting
each-other's profraw outputs.  %p is by PID, which one would expect to
be sufficient, and %Nm allows up to 9 concurrent profiles for the same
build-ID; perhaps the two together is overkill, but this seems like a
reasonable starting point.
We can now use the stable Rust toolchain and hopefully this no longer
suffers from the aforementioned OOMs.

Just fast sqllogictests for now, with coverage in the raw JSON format
until we figure out what we want to do with it.
So that one can conveniently locally run environmentd or sqllogictest
with coverage.
@tokenrove tokenrove force-pushed the julian/reenable-coverage branch from 4d741e3 to 402003b Compare March 23, 2023 11:37
@tokenrove tokenrove merged commit 6f60699 into MaterializeInc:main Mar 23, 2023
@tokenrove
Copy link
Contributor Author

Just to note: we don't get accurate results for clusterd currently because it gets killed by SIGTERM and libc's atexit handlers don't run, which would normally invoke __llvm_profile_write_file. Specifying %c in the LLVM_PROFILE_FILE variable should fix this, but seems to cause some other problem and profiling doesn't get written at all.

@def-
Copy link
Contributor

def- commented Mar 23, 2023

Based on my experience %c is only functional on macOS, not on Linux. I'll try to fix this by handling SIGTERM and writing the profile file.

@tokenrove
Copy link
Contributor Author

Based on my experience %c is only functional on macOS, not on Linux.

Oof, good to know.

handling SIGTERM and writing the profile file.

I considered this but signal handling in Rust is still a disaster. It should be sufficient to catch SIGTERM and then make sure the program exits so that atexit's handlers run.

I had hopes that rust-lang/rust#102900 would fix this since I get a ton of LLVM Profile Error: __llvm_profile_counter_bias is undefined messages when using %c, but alas it does not.

@def-
Copy link
Contributor

def- commented Mar 23, 2023

The %c problems are independent of Rust, see https://clang.llvm.org/docs/SourceBasedCodeCoverage.html

“%c” expands out to nothing, but enables a mode in which profile counter updates are continuously synced to a file. This means that if the instrumented program crashes, or is killed by a signal, perfect coverage information can still be recovered. Continuous mode does not support value profiling for PGO, and is only supported on Darwin at the moment. Support for Linux may be mostly complete but requires testing, and support for Windows may require more extensive changes: please get involved if you are interested in porting this feature.

I tried enabling the Linux support for %c, but the results were nonsensical. I didn't spend more effort on getting it working.

@tokenrove
Copy link
Contributor Author

Right, but this error message is pertinent, as it prevents any profiling from happening: https://github.com/llvm/llvm-project/blob/0eabf59528f3c3f64923900cae740d9f26c45ae8/compiler-rt/lib/profile/InstrProfilingFile.c#L574

Maybe we need to add the -runtime-counter-relocation option to LLVM.

In fact, this helps:

diff --git a/misc/python/materialize/cli/run.py b/misc/python/materialize/cli/run.py
index 964c2e1ac..5069dca69 100644
--- a/misc/python/materialize/cli/run.py
+++ b/misc/python/materialize/cli/run.py
@@ -268,7 +268,9 @@ def _build(args: argparse.Namespace, extra_programs: list[str] = []) -> int:
         features += ["tokio-console"]
         env["RUSTFLAGS"] = env.get("RUSTFLAGS", "") + " --cfg=tokio_unstable"
     if args.coverage:
-        env["RUSTFLAGS"] = env.get("RUSTFLAGS", "") + " -Cinstrument-coverage"
+        env["RUSTFLAGS"] = (env.get("RUSTFLAGS", "")
+                            + " -Cinstrument-coverage"
+                            + " -Cllvm-args=-runtime-counter-relocation")
     if args.features:
         features.extend(args.features.split(","))
     if features:
diff --git a/misc/python/materialize/mzbuild.py b/misc/python/materialize/mzbuild.py
index 08f6f0cda..c23c9b60d 100644
--- a/misc/python/materialize/mzbuild.py
+++ b/misc/python/materialize/mzbuild.py
@@ -232,6 +232,7 @@ class CargoBuild(CargoPreImage):
         if rd.coverage:
             self.rustflags += [
                 "-Cinstrument-coverage",
+                "-Cllvm-args=-runtime-counter-relocation",
             ]
         if len(self.bins) == 0 and len(self.examples) == 0:
             raise ValueError("mzbuild config is missing pre-build target")

I'll open a followup PR to add this though perhaps it's a rustc bug that this doesn't get supplied.

@def-
Copy link
Contributor

def- commented Mar 23, 2023

I mean I changed the LLVM/compiler-rt source code to enable bias variables on Linux, but they didn't work then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants