-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kola: Do failed units/SELinux checks both before *and* after tests #2067
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ahh oops I didn't see #2064 before this...hmm, I think we can merge the two? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh oops I didn't see #2064 before this...hmm, I think we can merge the two?
Sure, SGTM!
But I'm really hesitant to get this in again without at least a declarative way to disable the check via e.g. kola-denylist.yaml
. (But ideally, we'd have fine-grained control on allowed SELinux violations like suggested in #1920 (comment), but that could come after).
// CheckMachineBasics validates that no systemd units have | ||
// failed and no SELinux AVC denials were logged. It may be | ||
// extended in the future - the idea is these are "baseline" | ||
// errors that shouldn't be seen across any tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the idea is these are "baseline" errors that shouldn't be seen across any tests.
Hmm, I'm concerned this might not be a safe assumption. The tests themselves could be testing these errors. I don't really imagine us testing the SELinux policy really, but failed systemd units sounds plausible. Though I guess... the test then could fix the error condition and restart the unit so that it's not in a failed state?
Anyway, willing to try!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests themselves could be testing these errors.
Right, but anything intentionally causing a systemd unit failure could do e.g. systemctl reset-failed
before it exits. Scrubbing SELinux avc denials though is probably a lot more annoying to do.
Perhaps simplest is a per-test flag to turn this off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to this. Will relook once the two similar PRs are brought together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to merging PRs and what's in this one :)
First: Revert "Revert "kola: Add --no-default-checks"" (this reverts commit: 57b3520 ) Now, when I went to add the SELinux denials check, I completely missed that we only were checking for denials *before* the test runs. We obviously want to check both before and after - the same way we do with e.g. looking for errors on the console. As part of this, I'd also added `--no-default-checks` but it actually didn't work. Fix that so that it does work, and add a test case with a failing systemd unit that verifies this, so we can test the test system. In order to help ratchet things, also add a special `default-checks` entry to the denylist. This way if e.g. we are hitting SELinux AVC denials just on one arch/test we can turn them off temporarily.
0385eee
to
d53a4f6
Compare
@@ -63,6 +63,7 @@ func init() { | |||
bv(&kola.NoNet, "no-net", false, "Don't run tests that require an Internet connection") | |||
ssv(&kola.Tags, "tag", []string{}, "Test tag to run. Can be specified multiple times.") | |||
bv(&kola.Options.SSHOnTestFailure, "ssh-on-test-failure", false, "SSH into a machine when tests fail") | |||
bv(&kola.Options.SuppressDefaultChecks, "no-default-checks", false, "Disable default checks for failed systemd units and SELinux AVC denials") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more on this; WDYT about splitting this into two separate switches? One for systemd units and one for SELinux denials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To elaborate: I think eventually we should have precise control over what checks we want to allow to fail. Refining that can come later (because it requires more kola work), but at least right now we already have two broad categories we can split on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Agree that unit failures are usually much worse. But making it more granular feels tricky because we're likely to add more here. A good example is systemd generator failures and systemd ordering failures are also missed today.
And another good one is coredumps (which usually result in a unit failure, but not always).
Maybe what we really want is a syntax to ignore specific AVC denials or unit failures? Something like:
# This file documents currently known-to-fail kola tests. It is consumed by
# coreos-assembler to automatically skip some tests. For more information,
# see: https://github.com/coreos/coreos-assembler/pull/866.
tests:
- pattern: fcos.internet
tracker: https://github.com/coreos/coreos-assembler/pull/1478
- pattern: podman.workflow
tracker: https://github.com/coreos/coreos-assembler/pull/1478
avc:
# 3 tuple is string match on (vector, scontext, tcontext)
- read chronyd_t systemd_resolved_var_run_t
units:
# Failing on vsphere, but not a blocker
- NetworkManager.service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But making it more granular feels tricky because we're likely to add more here.
True. But while having a single switch is useful for dev, it's too blunt an instrument for CI/prod because it makes it too easy to miss out on regressions because of e.g. one AVC we're working around. (This was part of the motivation for #2064.)
Maybe what we really want is a syntax to ignore specific AVC denials or unit failures? Something like:
Yes, 100%. (This is what I mean with #1920 (comment); though yeah it'd be cleaner to first make the top-level object a dict rather than a list).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably before that, move this denylist directly into kola.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're willing to blanket-ignore denials/failures in specific tests, we have the existing test flags mechanism for this. It doesn't currently support more granular filters, but in principle it could.
Pairs with coreos/coreos-assembler#2067 which will allow us to turn on SELinux AVC checks before/after all tests. But we can't do that until the chrony AVC is fixed.
@cgwalters: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@cgwalters: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
I think there's rough agreement that this should be more granular rather than a global flag. One easy approach would just be special tags (e.g. That said, we actually haven't felt this pain too much in recent times AFAIK. |
When I went to add the SELinux denials check, I completely missed
that we only were checking for denials before the test runs.
We obviously want to check both before and after - the same
way we do with e.g. looking for errors on the console.
As part of this, I'd also added
--no-default-checks
but itactually didn't work. Fix that so that it does work, and add
a test case with a failing systemd unit that verifies this,
so we can test the test system.