Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logging to TimeoutHandler #1335

Merged
merged 1 commit into from
Oct 11, 2019
Merged

Conversation

fraenkel
Copy link
Contributor

Fixes #1307

Changes

The reconcile tests were causing panics or tripping race detection
because the timeout handler was using the *testing.T methods after the
test was already marked done.
The solution is to use loggers that are not tied to the testing
framework. The downside is that you may see logs from prior tests
intermixed but that is usually just the one associated with stopping the
timeout.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes
are included, additive changes
must be approved by at least two OWNERS
and backwards incompatible changes
must be approved by more than 50% of the OWNERS,
and they must first be added
in a backwards compatible way.

@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Sep 20, 2019
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 20, 2019
@tekton-robot
Copy link
Collaborator

Hi @fraenkel. Thanks for your PR.

I'm waiting for a tektoncd member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 20, 2019
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@tekton-robot tekton-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 20, 2019
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/timeout_handler.go 79.1% 75.2% -3.9

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fraenkel SGTM but needs a rebase

The reconcile tests were causing panics or tripping race detection
because the timeout handler was using the *testing.T methods after the
test was already marked done.
The solution is to use loggers that are not tied to the testing
framework. The downside is that you may see logs from prior tests
intermixed but that is usually just the one associated with stopping the
timeout.
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/timeout_handler.go 79.1% 75.2% -3.9

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2019
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 11, 2019
@tekton-robot tekton-robot merged commit ab2482a into tektoncd:master Oct 11, 2019
@fraenkel fraenkel deleted the logging branch March 3, 2020 12:36
@bobcatfish
Copy link
Collaborator

Hey @fraenkel , I'm super late to the party here but if you happen to remember:

The reconcile tests were causing panics or tripping race detection because the timeout handler was using the *testing.T methods after the test was already marked done.

Are you sure that the problem is specific to the testing logger? I am under the impression that this race condition would happen in normal controller execution as well, since the same logger would be used across goroutines as well.

@fraenkel
Copy link
Contributor Author

The original code used the zaptest Logger which wrote to t.Logf and may call t.Fail.
t.Fail is a bit of a dangerous thing when running on random goroutines and does not guarantee that the actual test will fail since it could be after the test has already completed.
As for the t.Logf, there is this latest issue, golang/go#40343
The documentation for testing.T states that while Log and Error can be used on other goroutines, the other methods may not.

@bobcatfish
Copy link
Collaborator

Hey @fraenkel thanks for the update!

I think that explains why the tests would catch a race condition; however I'm wondering if working around the race condition in the test logging functions means we are now hiding an existing race condition.

When the timeout handler is used in the controller, it is passed in a logger object such that I believe every go routine invoked by the timeout handler will be using the same logger:

This is where the logger is passed in:

timeoutHandler := timeout.NewHandler(ctx.Done(), logger)

From the context via:
https://github.com/knative/pkg/blob/68b3eeeaab46798f952dc7277c33d2b2bbae792b/logging/logger.go#L49-L57

So while updating the tests means there is no race condition in the tests, im concerned that in the actual controller, we're still using a non-thread safe logger object and writing to it across mulitple go routines - and further by removing the possibility of having a race in the tests, i think we might be covering that up?

@fraenkel
Copy link
Contributor Author

@bobcatfish I wouldn't say we are hiding a race. Most loggers are thread safe. t.Log however is a special purpose logger that is tied to a different lifecycle. If you really want to use a testing.T underneath the logger, you need to make a stronger gesture to stop all users of the logger before the test completes. This is a pretty tall task but could be valuable.

@bobcatfish
Copy link
Collaborator

Hey @fraenkel, unfortunately as described in #1307 the logger we are using isn't threadsafe :( uber-go/zap#99

as you can see in the example there, this actually results in interleaved logs when the same logger is used across goroutines; and afaik, when the controller actually runs, it will be using the same logger across goroutines.

@fraenkel
Copy link
Contributor Author

The zap logger is safe for concurrent use, https://godoc.org/go.uber.org/zap#Logger.
I use it myself and have had no issues.

@bobcatfish
Copy link
Collaborator

The zap logger is safe for concurrent use, https://godoc.org/go.uber.org/zap#Logger.
I use it myself and have had no issues.

Looking into the issue I linked above, it looks like the zap logger used to not be threadsafe, but that was fixed ages ago (2017?) and we've been using versions that included this fix all along.

Thanks for the explanation @fraenkel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Find a threadsafe way to log in timeout_handler
5 participants