Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Make envoy_select_quiche no-op. #6393

Merged
merged 14 commits into from
Apr 5, 2019

Conversation

wu-bin
Copy link
Contributor

@wu-bin wu-bin commented Mar 27, 2019

Description:

Remove envoy_select_quiche from envoy_build_system.bzl.

Risk Level: none. build only.
Testing:

bazel test --test_output=all test/extensions/quic_listeners/quiche/platform:all @com_googlesource_quiche//:all
bazel test --test_output=all --define quiche=enabled test/extensions/quic_listeners/quiche/platform:all @com_googlesource_quiche//:all

Docs Changes: none
Release Notes: none

@wu-bin wu-bin mentioned this pull request Mar 27, 2019
wu-bin added 2 commits March 28, 2019 10:33
1) add "bazel clean" for bazel.release, at the end of bazel_release_binary_build().
2) add "allow-multiple-definition" for coverage and clang_tidy.

Signed-off-by: Bin Wu <[email protected]>
Signed-off-by: Bin Wu <[email protected]>
@wu-bin wu-bin changed the title Get rid of envoy_select_quiche. ci: Make envoy_select_quiche no-op. Mar 28, 2019
@wu-bin
Copy link
Contributor Author

wu-bin commented Mar 28, 2019

/retest

@repokitteh-read-only
Copy link

🙀 failed invoking rebuild of ci/circleci: coverage: 500 Internal Server Error

🐱

Caused by: a #6393 (comment) was created by @wu-bin.

see: more, trace.

@wu-bin
Copy link
Contributor Author

wu-bin commented Mar 28, 2019

/retest

@repokitteh-read-only
Copy link

🙀 failed invoking rebuild of ci/circleci: docker: 500 Internal Server Error

🐱

Caused by: a #6393 (comment) was created by @wu-bin.

see: more, trace.

@wu-bin
Copy link
Contributor Author

wu-bin commented Mar 28, 2019

/retest

@repokitteh-read-only
Copy link

🙀 failed invoking rebuild of ci/circleci: coverage: 500 Internal Server Error

🐱

Caused by: a #6393 (comment) was created by @wu-bin.

see: more, trace.

@wu-bin
Copy link
Contributor Author

wu-bin commented Mar 29, 2019

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: asan (failed build)

🐱

Caused by: a #6393 (comment) was created by @wu-bin.

see: more, trace.

@wu-bin
Copy link
Contributor Author

wu-bin commented Mar 29, 2019

@lizan @htuch While working on this, the coverage ci(example) keeps failing at some EXPECT_DEATH statements(example), the symptom is that the test does die at the EXPECT_DEATH, but does not emit the expected error message before it dies, e.g.

test/extensions/quic_listeners/quiche/platform/quic_platform_test.cc:381: Failure
Death test: { switch (0) default: if (!(((__builtin_expect(!(false), 0))) && quic::IsLogLevelEnabled(quic::FATAL))) { } else quic::QuicLogEmitter(quic::FATAL).stream() << "CHECK failed: " "false" "." << " Supposed to fail in all modes."; }
Result: died but not with expected error.
Expected: contains regular expression "CHECK failed:.* Supposed to fail in all modes."
Actual msg:

I noticed run_envoy_bazel_coverage.sh sets "-c dbg --copt=-DNDEBUG" in the bazel test command, could it be related to the test failure?

@htuch
Copy link
Member

htuch commented Mar 29, 2019

@wu-bin I'm a bit confused by this PR. Why are there QUICHE test changes as well? Is this related?

@wu-bin
Copy link
Contributor Author

wu-bin commented Mar 29, 2019

@wu-bin I'm a bit confused by this PR. Why are there QUICHE test changes as well? Is this related?

Yes, QUICHE test is related. The test has 2 classes of failures when running in "coverage" ci:

  1. Some tests expect the log level is "error" at the beginning of the test, however, the test run in coverage ci set the default log level to "trace". This is easily fixed in the QUICHE test.

  2. There are a few tests failed at EXPECT_DEATH(). For example,
    EXPECT_DEATH(PANIC("abort now"), "abort now");
    will fail, because the PANIC statement caused the child process to die, but the child process doesn't print "abort now" before it dies. This is the issue I asked you and @lizan in the previous comment. I'm still struggling on this one. Do you have any idea?

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't see why CHECK would behave differently in coverage mode only, that's pretty weird. The main difference of coverage builds is we are using gcc and we compile all tests into a single test binary. I had a look at the CHECK implementation and it seems legit.

@wu-bin
Copy link
Contributor Author

wu-bin commented Mar 31, 2019

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #6393 (comment) was created by @wu-bin.

see: more, trace.

@wu-bin
Copy link
Contributor Author

wu-bin commented Apr 2, 2019

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #6393 (comment) was created by @wu-bin.

see: more, trace.

@wu-bin
Copy link
Contributor Author

wu-bin commented Apr 2, 2019

@htuch The latest coverage run actually went ok, it's "failed" because the code coverage is 0.1% lower than the threshold(see below). Is it ok if I lower the threshold to 97%?

sent 100,311,982 bytes received 19,076 bytes 200,662,116.00 bytes/sec
total size is 100,145,559 speedup is 1.00
Code coverage 97.4 is lower than limit of 97.5

@wu-bin
Copy link
Contributor Author

wu-bin commented Apr 2, 2019

This should also unblock @lizan 's #6229

@htuch
Copy link
Member

htuch commented Apr 2, 2019

@wu-bin we can't lower it; the only reason we ever do that is for structural reasons (e.g. switching coverage tools or toolchains). Why do you think this resulted in such as decrease?

I think we have a more serious situation though with the state of Envoy coverage. @envoyproxy/maintainers do we need to send out an all-hands to get folks to try and improve coverage again? I'm wondering if there is b/w to go fix some of the worst offending code.

@mattklein123
Copy link
Member

I think we have a more serious situation though with the state of Envoy coverage. @envoyproxy/maintainers do we need to send out an all-hands to get folks to try and improve coverage again? I'm wondering if there is b/w to go fix some of the worst offending code.

I will send an email and ask folks to look at the master coverage report to see if they recently added changes that are missing coverage.

@wu-bin
Copy link
Contributor Author

wu-bin commented Apr 2, 2019

@wu-bin we can't lower it; the only reason we ever do that is for structural reasons (e.g. switching coverage tools or toolchains). Why do you think this resulted in such as decrease?

I guess the reason for the slight drop of coverage is that, some quiche platform impl code are included in the report, but their tests are not under //test, instead they are tested by the (exernal) QUICHE code
for example: quiche/platform/string_utils.cc has low coverage in the report , but it is covered in this QUICHE test.

I believe quiche/platform/*.cc was not part of the report before this PR.

Does it count as a structural reason?

I think we have a more serious situation though with the state of Envoy coverage. @envoyproxy/maintainers do we need to send out an all-hands to get folks to try and improve coverage again? I'm wondering if there is b/w to go fix some of the worst offending code.

@htuch
Copy link
Member

htuch commented Apr 2, 2019

@wu-bin I think you have 3 options:

  1. Write Envoy-side tests for this code; after all, it's Envoy code :)
  2. Run the external QUICHE tests as part of coverage in CI.
  3. Exclude the QUICHE platform layer from Envoy's code coverage in CI.

If (2) can be made to work, that would be ideal, otherwise I would argue for (1) as being slightly more preferable to (3), but (3) is probably OK given the nature of this project and integration. At least IMHO, other maintainers might differ.

@wu-bin
Copy link
Contributor Author

wu-bin commented Apr 2, 2019

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #6393 (comment) was created by @wu-bin.

see: more, trace.

@wu-bin
Copy link
Contributor Author

wu-bin commented Apr 2, 2019

@wu-bin I think you have 3 options:

  1. Write Envoy-side tests for this code; after all, it's Envoy code :)
  2. Run the external QUICHE tests as part of coverage in CI.
  3. Exclude the QUICHE platform layer from Envoy's code coverage in CI.

If (2) can be made to work, that would be ideal, otherwise I would argue for (1) as being slightly more preferable to (3), but (3) is probably OK given the nature of this project and integration. At least IMHO, other maintainers might differ.

@htuch 2 is done. See the change in test/coverage/gen_build.sh. The PR is ready for review now.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few comments and we can ship it. Thanks.

@@ -54,6 +54,9 @@ function bazel_release_binary_build() {
cp -f "${ENVOY_DELIVERY_DIR}"/envoy "${ENVOY_SRCDIR}"/build_release
mkdir -p "${ENVOY_SRCDIR}"/build_release_stripped
strip "${ENVOY_DELIVERY_DIR}"/envoy -o "${ENVOY_SRCDIR}"/build_release_stripped/envoy
# TODO(wu-bin): Remove once https://github.com/envoyproxy/envoy/pull/6229 is merged.
bazel clean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to clean?

Copy link
Contributor Author

@wu-bin wu-bin Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What appears to be happening is that the "bazel build" in this function and the "bazel build" after this function created same object files in different directories, once in bazel-bin/external/envoy/source/..., and once in bazel-bin/source/...

When the second "bazel build" runs, both instances of object files are given into the linker, causing "multiple definition" errors. See this link for the actual failure message.

The bazel clean hack here removes one copy of the objects, thus avoids the "multiple definition" error. The hack won't be needed after the ci WORKSPACE is removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, @lizan do you know when we might see #6229 merged? Would be great to avoid having to include this, but if not we can hopefully get a cleanup PR ASAP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm completely stuck on that one as the coverage is hard to debug (and not reproducible locally very well), will try something tonight but no ETA for now. The coverage failure seems unrelated to the change though. It would be nice if you can take a look (pushed latest master to the PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lizan The latest coverage run from your PR failed because some GRPC related test failed. Those tests seems to be flaky, since I've seen the same failures while debugging my PR. I've just started a retest.

@htuch #6229 was stuck on coverage failure, which is likely fixed in this PR, since my change in do_ci.sh's "bazel.coverage" section is very similar to that one. I think we should get this PR in first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the coverage test failure in #6229 is not a flake, I have retested it 3 times, the following test fails every time:

SslIpVersionsClientType/GrpcSslClientIntegrationTest.BasicSslRequestWithClientCert/IPv4_EnvoyGrpc

@htuch htuch added the waiting label Apr 2, 2019
@htuch
Copy link
Member

htuch commented Apr 2, 2019

@wu-bin hmm, I just had a look a the CI logs. We've gone from ~400 Envoy tests to 11961 tests in total. That seems a scary increase :) CI took 1h 41m, vs. a recent run in another PR of 1h 30m. If we could just filter down to the platform tests, maybe this wouldn't be so bad.

@wu-bin
Copy link
Contributor Author

wu-bin commented Apr 3, 2019

@wu-bin hmm, I just had a look a the CI logs. We've gone from ~400 Envoy tests to 11961 tests in total. That seems a scary increase :) CI took 1h 41m, vs. a recent run in another PR of 1h 30m. If we could just filter down to the platform tests, maybe this wouldn't be so bad.

Impressive growth:)

All the current QUICHE tests are platform tests, I'll discuss with @danzh2010 and @mpwarres on how to manage the non-platform tests.

@wu-bin
Copy link
Contributor Author

wu-bin commented Apr 4, 2019

Oh sorry I might misunderstood you.

To clarify, the 11K tests are mostly Envoy proper tests, QUICHE platform currently adds about 11961(from this pr)-11889(from a test of master)=72 tests.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I think I was looking at a non-coverage run which didn't aggregate before or something. LGTM once CI passes. We are going into a merge freeze in 45 mins until tomorrow afternoon for the 1.9.1 security release fix FYI.

@htuch htuch merged commit d1cdd25 into envoyproxy:master Apr 5, 2019
mpuncel added a commit to mpuncel/envoy that referenced this pull request Apr 8, 2019
* master: (137 commits)
  test: router upstream log to v2 config stubs (envoyproxy#6499)
  remove idle timeout validation (envoyproxy#6500)
  build: Change namespace of chromium_url. (envoyproxy#6506)
  coverage: exclude chromium_url (envoyproxy#6498)
  fix(tracing): allow 256 chars in path tag (envoyproxy#6492)
  Common: Introduce StopAllIteration filter status for decoding and encoding filters (envoyproxy#5954)
  build: update PGV url (envoyproxy#6495)
  subset lb: avoid partitioning host lists on worker threads (envoyproxy#6302)
  ci: Make envoy_select_quiche no-op. (envoyproxy#6393)
  watcher: notify when watched files are modified (envoyproxy#6215)
  stat: Add counterFromStatName(), gaugeFromStatName(), and histogramFromStatName() (envoyproxy#6475)
  bump to 1.11.0-dev (envoyproxy#6490)
  release: bump to 1.10.0 (envoyproxy#6489)
  hcm: path normalization. (#1)
  build: import manually minified Chrome URL lib. (envoyproxy#3)
  codec: reject embedded NUL in headers. (envoyproxy#2)
  Added veryfication if path contains query params and add them to path header (envoyproxy#6466)
  redis: basic integration test for redis_proxy (envoyproxy#6450)
  stats: report sample count as an integer to prevent loss of precision (envoyproxy#6274)
  Added VHDS protobuf message and updated RouteConfig to include it. (envoyproxy#6418)
  ...

Signed-off-by: Michael Puncel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants