Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check and/or add to sent cloud event cache in one operation #5313

Merged

Conversation

abayer
Copy link
Contributor

@abayer abayer commented Aug 11, 2022

Changes

fixes (hopefully) #5160

At least some of the TestReconcile_CloudEvents failures in pkg/reconciler/run/run_test.go we've been seeing are due to a race condition around the event cache used to prevent double-sending Run events. This is because we check the cache, then send the event, and finally write to the cache, creating a race where an initial attempt to send an event may not get to writing to the cache until after a second attempt has gotten past checking the cache. This fixes that by writing to the cache immediately after checking the cache.

In my test bed, with this fix (and a 50ms sleep added to CheckEventsOrdered and CheckEventsUnordered to make sure that we leave some leeway for asynchronous event delivery to complete before checking events), TestReconcile_CloudEvents went from failing a relatively consistent one out of every ten or so runs to failing 20 times out of 1,800 runs over the weekend. So there's still some flakiness there, but much, much less than there was previously

/kind flake

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

  • Has Docs included if any changes are user facing
  • Has Tests included if any functionality added or changed
  • Follows the commit message standard
  • Meets the Tekton contributor standards (including
    functionality, content, code)
  • Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

@tekton-robot tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesnt merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 11, 2022
@abayer
Copy link
Contributor Author

abayer commented Aug 11, 2022

/kind flake
/assign @lbernick

@tekton-robot tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Aug 11, 2022
@abayer
Copy link
Contributor Author

abayer commented Aug 11, 2022

/retest

@abayer
Copy link
Contributor Author

abayer commented Aug 11, 2022

Ok, the flake count went down considerably, but I also saw a case of an event being set twice even though it was theoretically cached. Digging into that now...

@abayer
Copy link
Contributor Author

abayer commented Aug 11, 2022

/retest

There's definitely something else wrong with the event cache - I'm seeing two attempts to send an identical event, which it shouldn't be doing. Digging further!

@lbernick
Copy link
Member

What's the actual race condition here? This seems like it might be a bandaid over most of the test flakes, but is it possible to fix the underlying problem?

@abayer
Copy link
Contributor Author

abayer commented Aug 11, 2022

What's the actual race condition here? This seems like it might be a bandaid over most of the test flakes, but is it possible to fix the underlying problem?

That's part of what I'm still trying to figure out. If I am right about the retries (which I'm less sure of now!), then the race would be that sometimes, under load, sending the cloud event fails the first time and gets resent 10ms later, which is fine. That's why we have the retries there - the end result is correct, it's just the test scenario that has problems due to expecting all the events to be delivered by the time the PipelineRun reconciliation is over...which may not be the case if it had to retry.

But I have now seen at least one case (with a bunch of extra debug logging tossed in to give me better visibility into what's actually happening) of an event getting sent successfully, then when the second reconcile happens, it shouldn't be sending the event again because it's already in the cloud event controller cache, but it is, in fact, being sent again. So that's where I'm poking around in right now.

@abayer
Copy link
Contributor Author

abayer commented Aug 11, 2022

Also, to make it more fun, in my test setup, TestReconcile_InvalidPipelineRuns in pkg/reconciler/pipelinerun/pipelinerun_test.go is flaking out fairly often, which I've never seen in CI and am assuming is just a weird artifact of the resource-limited setup I'm running in...which, of course, could throw off all my theories around the cloud events tests. Fun!

@abayer
Copy link
Contributor Author

abayer commented Aug 12, 2022

So I've added 1cdc4fa to this PR - as I detailed in #5160, I've narrowed down the cause of the run_test cloud events failures to multiple events for the same Run being sent before the first call finishes, so adding a mutex to cloud event dispatch in the Run reconciler should solve that. This particular issue is very specific to Runs - we only do caching to prevent duplicate event dispatch for Runs. This mutex should also deal with the race condition that shows up sometimes in the various cloud event tests in pkg/reconciler/run/run_test.go, I believe.

That said, I think the sleep in the original commit here is still valuable for other event test flakiness, such as TestSendKubernetesEvents, TestSendCloudEventWithRetries/test-send-cloud-event-taskrun, and TestReconcile_CloudEvents in taskrun_test.go - these event-related flakes are all cases where events, either k8s events or cloud events, have not yet been received when we go to check them, rather than the Run/cloud events flakes, which are caused by a duplicate event being received. So I do still think that a delay should cut down on the "missing event" flakes.

@afrittoli
Copy link
Member

Thanks for this. I wonder if we should add the mutex further down in the code structure though, because this will hold the entire reconcile cycle, so it will basically force reconciles to not happen in parallel as long as long as cloudevents for run is enabled. It feels like this would slow down things a lot because the event will be taken from the queue and sit there waiting.

Perhaps it would be enough to add locks to the methods in https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/events/cache/cache.go, similar to what the original cache class does?

@abayer
Copy link
Contributor Author

abayer commented Aug 12, 2022

Thanks for this. I wonder if we should add the mutex further down in the code structure though, because this will hold the entire reconcile cycle, so it will basically force reconciles to not happen in parallel as long as long as cloudevents for run is enabled. It feels like this would slow down things a lot because the event will be taken from the queue and sit there waiting.

Yeah, I don't love this location, but the thing is that we do actually need to lock the whole of events.EmitCloudEvents, or at least all of cloudevent.SendCloudEventWithRetries. That said, I'm 100% sure there's a better way to do it than what I've done here, I just haven't landed on it yet.

Perhaps it would be enough to add locks to the methods in https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/events/cache/cache.go, similar to what the original cache class does?

No, that wouldn't work, because it's not that we're having problems because of the cache calls directly. The problem is that

wasIn <- nil
logger.Debugf("Sending cloudevent of type %q", event.Type())
// In case of Run event, check cache if cloudevent is already sent
if isRun {
cloudEventSent, err := cache.IsCloudEventSent(cacheClient, event)
if err != nil {
logger.Errorf("error while checking cache: %s", err)
}
if cloudEventSent {
logger.Infof("cloudevent %v already sent", event)
return
}
}
if result := ceClient.Send(cloudevents.ContextWithRetriesExponentialBackoff(ctx, 10*time.Millisecond, 10), *event); !cloudevents.IsACK(result) {
logger.Warnf("Failed to send cloudevent: %s", result.Error())
recorder := controller.GetEventRecorder(ctx)
if recorder == nil {
logger.Warnf("No recorder in context, cannot emit error event")
}
recorder.Event(object, corev1.EventTypeWarning, "Cloud Event Failure", result.Error())
}
// In case of Run event, add to the cache to avoid duplicate events
if isRun {
if err := cache.AddEventSentToCache(cacheClient, event); err != nil {
logger.Errorf("error while adding sent event to cache: %s", err)
}
}
isn't locked, so two different calls can be somewhere in that block at the same time. What's happening is that call 1 gets past cache.IsCloudEventSent(...) and into ceClient.Send(...), and before call 1 gets past that to get to cache.AddEventSentToCache(...), call 2 gets to cache.IsCloudEventSent(...). So really, we want a lock on that go func()....

@afrittoli
Copy link
Member

Thanks for this. I wonder if we should add the mutex further down in the code structure though, because this will hold the entire reconcile cycle, so it will basically force reconciles to not happen in parallel as long as long as cloudevents for run is enabled. It feels like this would slow down things a lot because the event will be taken from the queue and sit there waiting.

Yeah, I don't love this location, but the thing is that we do actually need to lock the whole of events.EmitCloudEvents, or at least all of cloudevent.SendCloudEventWithRetries. That said, I'm 100% sure there's a better way to do it than what I've done here, I just haven't landed on it yet.

Perhaps it would be enough to add locks to the methods in https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/events/cache/cache.go, similar to what the original cache class does?

No, that wouldn't work, because it's not that we're having problems because of the cache calls directly. The problem is that

wasIn <- nil
logger.Debugf("Sending cloudevent of type %q", event.Type())
// In case of Run event, check cache if cloudevent is already sent
if isRun {
cloudEventSent, err := cache.IsCloudEventSent(cacheClient, event)
if err != nil {
logger.Errorf("error while checking cache: %s", err)
}
if cloudEventSent {
logger.Infof("cloudevent %v already sent", event)
return
}
}
if result := ceClient.Send(cloudevents.ContextWithRetriesExponentialBackoff(ctx, 10*time.Millisecond, 10), *event); !cloudevents.IsACK(result) {
logger.Warnf("Failed to send cloudevent: %s", result.Error())
recorder := controller.GetEventRecorder(ctx)
if recorder == nil {
logger.Warnf("No recorder in context, cannot emit error event")
}
recorder.Event(object, corev1.EventTypeWarning, "Cloud Event Failure", result.Error())
}
// In case of Run event, add to the cache to avoid duplicate events
if isRun {
if err := cache.AddEventSentToCache(cacheClient, event); err != nil {
logger.Errorf("error while adding sent event to cache: %s", err)
}
}

isn't locked, so two different calls can be somewhere in that block at the same time. What's happening is that call 1 gets past cache.IsCloudEventSent(...) and into ceClient.Send(...), and before call 1 gets past that to get to cache.AddEventSentToCache(...), call 2 gets to cache.IsCloudEventSent(...). So really, we want a lock on that go func()....

So, I'm thinking that we should change this function to add the event to the cache before we invoke the actual send function, since that might do retries and take an arbitrary amount of time to complete.

I would propose to:

  • Change the AddEventSentToCache function to, lock, read, write, unlock, and return whether the event was already there
  • Call the AddEventSentToCache where the IsCloudEventSent is now, and use the result to decide whether to send the event or not

@abayer
Copy link
Contributor Author

abayer commented Aug 12, 2022

Yeah, that all sounds good to me. Lemme do that.

@abayer abayer force-pushed the hopefully-deal-with-check-events-race branch from 1cdc4fa to 344cc1d Compare August 12, 2022 18:00
@tekton-robot tekton-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 12, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cache/cache.go 78.3% 82.4% 4.1
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.6% 88.2% -0.3

@abayer
Copy link
Contributor Author

abayer commented Aug 12, 2022

/retest

@abayer abayer force-pushed the hopefully-deal-with-check-events-race branch from 344cc1d to 91f4f98 Compare August 15, 2022 14:39
@abayer abayer changed the title WIP: Have event checking test helpers wait 50ms before reading events from channel Check and/or add to sent cloud event cache in one operation Aug 15, 2022
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cache/cache.go 78.3% 82.4% 4.1
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.6% 88.2% -0.3

@abayer
Copy link
Contributor Author

abayer commented Aug 15, 2022

/retest

@abayer
Copy link
Contributor Author

abayer commented Aug 15, 2022

/assign @afrittoli @jerop

@abayer
Copy link
Contributor Author

abayer commented Aug 17, 2022

/assign @vdemeester

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 18, 2022
@abayer
Copy link
Contributor Author

abayer commented Aug 19, 2022

ping @afrittoli @jerop @lbernick =)

cacheClient.Add(eventKey, nil)
return nil
}

// IsCloudEventSent checks if the event exists in the cache
func IsCloudEventSent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we need to update this function name and description

Suggested change
func IsCloudEventSent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) {
// ContainsOrAddCloudEvent checks if the event exists in the cache, and if not, adds the event
func ContainsOrAddCloudEvent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Doing so now.

fixes (hopefully) tektoncd#5160

At least some of the `TestReconcile_CloudEvents` failures in `pkg/reconciler/run/run_test.go` we've been seeing are due to a race condition around the event cache used to prevent double-sending `Run` events. This is because we check the cache, then send the event, and finally write to the cache, creating a race where an initial attempt to send an event may not get to writing to the cache until after a second attempt has gotten past checking the cache. This fixes that by writing to the cache immediately after checking the cache.

In my test bed, with this fix (and a 50ms sleep added to `CheckEventsOrdered` and `CheckEventsUnordered` to make sure that we leave some leeway for asynchronous event delivery to complete before checking events), `TestReconcile_CloudEvents` went from failing a relatively consistent one out of every ten or so runs to failing 20 times out of 1,800 runs over the weekend. So there's still some flakiness there, but much, much less than there was previously

Signed-off-by: Andrew Bayer <[email protected]>

Check and/or add to sent cloud event cache in one operation

Signed-off-by: Andrew Bayer <[email protected]>
@abayer abayer force-pushed the hopefully-deal-with-check-events-race branch from 91f4f98 to 9207596 Compare August 19, 2022 15:28
@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 19, 2022
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/events/cache/cache.go 78.3% 82.4% 4.1
pkg/reconciler/events/cloudevent/cloud_event_controller.go 88.6% 88.2% -0.3

Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

thanks @abayer!

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 19, 2022
@tekton-robot tekton-robot merged commit 7dfe4ed into tektoncd:main Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/flake Categorizes issue or PR as related to a flakey test lgtm Indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesnt merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants