-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leverage logstream to get better e2e diagnostic logs. #3388
Conversation
logstream is a package that we use in knative to inline the system component logs relevant to a particular test into the test logs. It relies on tests following the auto-naming convention we use with our helper method `helpers.ObjectNameForTest(t)`, which attaches a common prefix to resource names allowing logstream to filter the relevant log lines. By virtue of using the `genreconciler` and other knative.dev/pkg controller infrastructure, each reconciler using `logging.FromContext(ctx)` on the context passed to `ReconcileKind` already gets a logger that's infused with the "key" being reconciled, but there may be other shared libraries that may require further instrumentation for this to reach its full potential. I'm instrumenting several tests where I have seen downstream flakes as a start, and hopefully this illuminated why those tests are flaking by shining a light on controller activities during the test's execution.
I may tack on some additional flaky tests here, but I wanted to stage one at a time to check that e2e tests pass (and I'm not making any bad assumptions) |
/kind feature |
/test check-pr-has-kind-label |
First test passed, adding |
/lgtm This looks really cool! I think it's going to make debugging flakes a lot simpler, by moving all the relevant information into the same place. |
Yeah, the intent of it was to try and pre-filter, join and interleave all of the relevant logs into It can mean large logs, but it is generally pretty useful. |
I have a couple more queued up that I've seen flake downstream in the last ~24 hours: I'll push a commit for those once the above comes back clean 🤞 |
Next one passed. I've pushed two more commits with two more tests. I'll stop there, so we can hopefully checkpoint things and get some better diagnostics from failures 🤞 |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, thank you!
The integration job is still running, once it's done I'll be able to see the difference :)
If we decide to settle for this it would be good to start documenting best practices when writing tests so that new tests will fit this approach.
@afrittoli yup, for once we documented it in PKG 😅 https://github.com/knative/pkg/blob/master/test/logstream/README.md |
Here's a sample log line from a serving e2e flake:
The format is Full test output (that I picked that from): https://prow.knative.dev/view/gcs/knative-prow/logs/ci-knative-serving-contour-latest/1315090891001565185 |
I'm not gonna lie either, I partly did this because JSON logs made my eyes bleed outside of something like stackdriver 🤣 |
You can start to see it if you crack the raw logs of the running job too:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this looks really good!
Something that is missing today in logs is a timestamp (this is a separate issue) but you can see it reflected here in the 00:00:00 timestamp:
TestTaskRunPipelineRunCancel/retries=1: kubelogs.go:197: D 00:00:00.000 tekton-pipelines-controller-68f598fdd7-j7xkj [controller/controller.go:397] [arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh-task-wfj26] Adding to queue arendelle-6p6bp/task-run-pipeline-run-cancel-retries-1-djhhcdvh-task-wfj26 (depth: 2)
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: afrittoli The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@afrittoli 😅 yeah I think that's pretty easy to add. You folks don't put timestamps in your logs by default 🤯 |
Yeah, I'm not sure why, and every time I forget to make the PR... Will do this time |
Changes
logstream
is a package that we use in knative to inline the system component logs relevant to a particular test into the test logs. It relies on tests following the auto-naming convention we use with our helper methodhelpers.ObjectNameForTest(t)
, which attaches a common prefix to resource names allowing logstream to filter the relevant log lines.By virtue of using the
genreconciler
and other knative.dev/pkg controller infrastructure, each reconciler usinglogging.FromContext(ctx)
on the context passed toReconcileKind
already gets a logger that's infused with the "key" being reconciled, but there may be other shared libraries that may require further instrumentation for this to reach its full potential.I'm instrumenting several tests where I have seen downstream flakes as a start, and hopefully this illuminated why those tests are flaking by shining a light on controller activities during the test's execution.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.
Release Notes
/assign @imjasonh @vdemeester @afrittoli