Check and/or add to sent cloud event cache in one operation #5313

abayer · 2022-08-11T17:24:21Z

Changes

fixes (hopefully) #5160

At least some of the TestReconcile_CloudEvents failures in pkg/reconciler/run/run_test.go we've been seeing are due to a race condition around the event cache used to prevent double-sending Run events. This is because we check the cache, then send the event, and finally write to the cache, creating a race where an initial attempt to send an event may not get to writing to the cache until after a second attempt has gotten past checking the cache. This fixes that by writing to the cache immediately after checking the cache.

In my test bed, with this fix (and a 50ms sleep added to CheckEventsOrdered and CheckEventsUnordered to make sure that we leave some leeway for asynchronous event delivery to complete before checking events), TestReconcile_CloudEvents went from failing a relatively consistent one out of every ten or so runs to failing 20 times out of 1,800 runs over the weekend. So there's still some flakiness there, but much, much less than there was previously

/kind flake

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs included if any changes are user facing
Has Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including
functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

abayer · 2022-08-11T17:24:30Z

/kind flake
/assign @lbernick

abayer · 2022-08-11T17:54:46Z

/retest

abayer · 2022-08-11T17:56:40Z

Ok, the flake count went down considerably, but I also saw a case of an event being set twice even though it was theoretically cached. Digging into that now...

abayer · 2022-08-11T18:55:18Z

/retest

There's definitely something else wrong with the event cache - I'm seeing two attempts to send an identical event, which it shouldn't be doing. Digging further!

lbernick · 2022-08-11T19:11:02Z

What's the actual race condition here? This seems like it might be a bandaid over most of the test flakes, but is it possible to fix the underlying problem?

abayer · 2022-08-11T19:31:38Z

What's the actual race condition here? This seems like it might be a bandaid over most of the test flakes, but is it possible to fix the underlying problem?

That's part of what I'm still trying to figure out. If I am right about the retries (which I'm less sure of now!), then the race would be that sometimes, under load, sending the cloud event fails the first time and gets resent 10ms later, which is fine. That's why we have the retries there - the end result is correct, it's just the test scenario that has problems due to expecting all the events to be delivered by the time the PipelineRun reconciliation is over...which may not be the case if it had to retry.

But I have now seen at least one case (with a bunch of extra debug logging tossed in to give me better visibility into what's actually happening) of an event getting sent successfully, then when the second reconcile happens, it shouldn't be sending the event again because it's already in the cloud event controller cache, but it is, in fact, being sent again. So that's where I'm poking around in right now.

abayer · 2022-08-11T19:33:26Z

Also, to make it more fun, in my test setup, TestReconcile_InvalidPipelineRuns in pkg/reconciler/pipelinerun/pipelinerun_test.go is flaking out fairly often, which I've never seen in CI and am assuming is just a weird artifact of the resource-limited setup I'm running in...which, of course, could throw off all my theories around the cloud events tests. Fun!

abayer · 2022-08-12T15:11:48Z

So I've added 1cdc4fa to this PR - as I detailed in #5160, I've narrowed down the cause of the run_test cloud events failures to multiple events for the same Run being sent before the first call finishes, so adding a mutex to cloud event dispatch in the Run reconciler should solve that. This particular issue is very specific to Runs - we only do caching to prevent duplicate event dispatch for Runs. This mutex should also deal with the race condition that shows up sometimes in the various cloud event tests in pkg/reconciler/run/run_test.go, I believe.

That said, I think the sleep in the original commit here is still valuable for other event test flakiness, such as TestSendKubernetesEvents, TestSendCloudEventWithRetries/test-send-cloud-event-taskrun, and TestReconcile_CloudEvents in taskrun_test.go - these event-related flakes are all cases where events, either k8s events or cloud events, have not yet been received when we go to check them, rather than the Run/cloud events flakes, which are caused by a duplicate event being received. So I do still think that a delay should cut down on the "missing event" flakes.

afrittoli · 2022-08-12T15:24:10Z

Thanks for this. I wonder if we should add the mutex further down in the code structure though, because this will hold the entire reconcile cycle, so it will basically force reconciles to not happen in parallel as long as long as cloudevents for run is enabled. It feels like this would slow down things a lot because the event will be taken from the queue and sit there waiting.

Perhaps it would be enough to add locks to the methods in https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/events/cache/cache.go, similar to what the original cache class does?

abayer · 2022-08-12T15:43:13Z

Thanks for this. I wonder if we should add the mutex further down in the code structure though, because this will hold the entire reconcile cycle, so it will basically force reconciles to not happen in parallel as long as long as cloudevents for run is enabled. It feels like this would slow down things a lot because the event will be taken from the queue and sit there waiting.

Yeah, I don't love this location, but the thing is that we do actually need to lock the whole of events.EmitCloudEvents, or at least all of cloudevent.SendCloudEventWithRetries. That said, I'm 100% sure there's a better way to do it than what I've done here, I just haven't landed on it yet.

Perhaps it would be enough to add locks to the methods in https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/events/cache/cache.go, similar to what the original cache class does?

No, that wouldn't work, because it's not that we're having problems because of the cache calls directly. The problem is that

pipeline/pkg/reconciler/events/cloudevent/cloud_event_controller.go

Lines 148 to 174 in 45b7c37

    
           wasIn <- nil 
        
           logger.Debugf("Sending cloudevent of type %q", event.Type()) 
        
           // In case of Run event, check cache if cloudevent is already sent 
        
           if isRun { 
        
           	cloudEventSent, err := cache.IsCloudEventSent(cacheClient, event) 
        
           	if err != nil { 
        
           		logger.Errorf("error while checking cache: %s", err) 
        
           	} 
        
           	if cloudEventSent { 
        
           		logger.Infof("cloudevent %v already sent", event) 
        
           		return 
        
           	} 
        
           } 
        
           if result := ceClient.Send(cloudevents.ContextWithRetriesExponentialBackoff(ctx, 10*time.Millisecond, 10), *event); !cloudevents.IsACK(result) { 
        
           	logger.Warnf("Failed to send cloudevent: %s", result.Error()) 
        
           	recorder := controller.GetEventRecorder(ctx) 
        
           	if recorder == nil { 
        
           		logger.Warnf("No recorder in context, cannot emit error event") 
        
           	} 
        
           	recorder.Event(object, corev1.EventTypeWarning, "Cloud Event Failure", result.Error()) 
        
           } 
        
           // In case of Run event, add to the cache to avoid duplicate events 
        
           if isRun { 
        
           	if err := cache.AddEventSentToCache(cacheClient, event); err != nil { 
        
           		logger.Errorf("error while adding sent event to cache: %s", err) 
        
           	} 
        
           }

isn't locked, so two different calls can be somewhere in that block at the same time. What's happening is that call 1 gets past cache.IsCloudEventSent(...) and into ceClient.Send(...), and before call 1 gets past that to get to cache.AddEventSentToCache(...), call 2 gets to cache.IsCloudEventSent(...). So really, we want a lock on that go func()....

afrittoli · 2022-08-12T17:47:57Z

Thanks for this. I wonder if we should add the mutex further down in the code structure though, because this will hold the entire reconcile cycle, so it will basically force reconciles to not happen in parallel as long as long as cloudevents for run is enabled. It feels like this would slow down things a lot because the event will be taken from the queue and sit there waiting.

Yeah, I don't love this location, but the thing is that we do actually need to lock the whole of events.EmitCloudEvents, or at least all of cloudevent.SendCloudEventWithRetries. That said, I'm 100% sure there's a better way to do it than what I've done here, I just haven't landed on it yet.

Perhaps it would be enough to add locks to the methods in https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/events/cache/cache.go, similar to what the original cache class does?

No, that wouldn't work, because it's not that we're having problems because of the cache calls directly. The problem is that

pipeline/pkg/reconciler/events/cloudevent/cloud_event_controller.go

Lines 148 to 174 in 45b7c37

wasIn <- nil

logger.Debugf("Sending cloudevent of type %q", event.Type())

// In case of Run event, check cache if cloudevent is already sent

if isRun {

cloudEventSent, err := cache.IsCloudEventSent(cacheClient, event)

if err != nil {

logger.Errorf("error while checking cache: %s", err)

}

if cloudEventSent {

logger.Infof("cloudevent %v already sent", event)

return

}

}

if result := ceClient.Send(cloudevents.ContextWithRetriesExponentialBackoff(ctx, 10*time.Millisecond, 10), *event); !cloudevents.IsACK(result) {

logger.Warnf("Failed to send cloudevent: %s", result.Error())

recorder := controller.GetEventRecorder(ctx)

if recorder == nil {

logger.Warnf("No recorder in context, cannot emit error event")

}

recorder.Event(object, corev1.EventTypeWarning, "Cloud Event Failure", result.Error())

}

// In case of Run event, add to the cache to avoid duplicate events

if isRun {

if err := cache.AddEventSentToCache(cacheClient, event); err != nil {

logger.Errorf("error while adding sent event to cache: %s", err)

}

}

isn't locked, so two different calls can be somewhere in that block at the same time. What's happening is that call 1 gets past cache.IsCloudEventSent(...) and into ceClient.Send(...), and before call 1 gets past that to get to cache.AddEventSentToCache(...), call 2 gets to cache.IsCloudEventSent(...). So really, we want a lock on that go func()....

So, I'm thinking that we should change this function to add the event to the cache before we invoke the actual send function, since that might do retries and take an arbitrary amount of time to complete.

I would propose to:

Change the AddEventSentToCache function to, lock, read, write, unlock, and return whether the event was already there
Call the AddEventSentToCache where the IsCloudEventSent is now, and use the result to decide whether to send the event or not

abayer · 2022-08-12T17:53:57Z

Yeah, that all sounds good to me. Lemme do that.

tekton-robot · 2022-08-12T18:05:30Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/events/cache/cache.go	78.3%	82.4%	4.1
pkg/reconciler/events/cloudevent/cloud_event_controller.go	88.6%	88.2%	-0.3

abayer · 2022-08-12T18:38:17Z

/retest

tekton-robot · 2022-08-15T14:47:04Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/events/cache/cache.go	78.3%	82.4%	4.1
pkg/reconciler/events/cloudevent/cloud_event_controller.go	88.6%	88.2%	-0.3

abayer · 2022-08-15T15:12:09Z

/retest

abayer · 2022-08-15T16:28:00Z

/assign @afrittoli @jerop

abayer · 2022-08-17T13:44:12Z

/assign @vdemeester

tekton-robot · 2022-08-18T14:45:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abayer · 2022-08-19T14:39:39Z

ping @afrittoli @jerop @lbernick =)

jerop · 2022-08-19T15:18:41Z

pkg/reconciler/events/cache/cache.go

-	cacheClient.Add(eventKey, nil)
-	return nil
-}
-
 // IsCloudEventSent checks if the event exists in the cache
 func IsCloudEventSent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) {


looks like we need to update this function name and description

Suggested change

func IsCloudEventSent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) {

// ContainsOrAddCloudEvent checks if the event exists in the cache, and if not, adds the event

func ContainsOrAddCloudEvent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) {

Whoops! Doing so now.

fixes (hopefully) tektoncd#5160 At least some of the `TestReconcile_CloudEvents` failures in `pkg/reconciler/run/run_test.go` we've been seeing are due to a race condition around the event cache used to prevent double-sending `Run` events. This is because we check the cache, then send the event, and finally write to the cache, creating a race where an initial attempt to send an event may not get to writing to the cache until after a second attempt has gotten past checking the cache. This fixes that by writing to the cache immediately after checking the cache. In my test bed, with this fix (and a 50ms sleep added to `CheckEventsOrdered` and `CheckEventsUnordered` to make sure that we leave some leeway for asynchronous event delivery to complete before checking events), `TestReconcile_CloudEvents` went from failing a relatively consistent one out of every ten or so runs to failing 20 times out of 1,800 runs over the weekend. So there's still some flakiness there, but much, much less than there was previously Signed-off-by: Andrew Bayer <[email protected]> Check and/or add to sent cloud event cache in one operation Signed-off-by: Andrew Bayer <[email protected]>

tekton-robot · 2022-08-19T15:32:55Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/events/cache/cache.go	78.3%	82.4%	4.1
pkg/reconciler/events/cloudevent/cloud_event_controller.go	88.6%	88.2%	-0.3

jerop

/lgtm

thanks @abayer!

tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesnt merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 11, 2022

tekton-robot requested review from lbernick and vdemeester August 11, 2022 17:24

tekton-robot assigned lbernick Aug 11, 2022

tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Aug 11, 2022

afrittoli mentioned this pull request Aug 12, 2022

flake: Cloud events unit tests #5160

Closed

abayer force-pushed the hopefully-deal-with-check-events-race branch from 1cdc4fa to 344cc1d Compare August 12, 2022 18:00

tekton-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 12, 2022

abayer force-pushed the hopefully-deal-with-check-events-race branch from 344cc1d to 91f4f98 Compare August 15, 2022 14:39

abayer changed the title ~~WIP: Have event checking test helpers wait 50ms before reading events from channel~~ Check and/or add to sent cloud event cache in one operation Aug 15, 2022

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2022

tekton-robot assigned afrittoli and jerop Aug 15, 2022

tekton-robot assigned vdemeester Aug 17, 2022

vdemeester approved these changes Aug 18, 2022

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 18, 2022

jerop reviewed Aug 19, 2022

View reviewed changes

abayer force-pushed the hopefully-deal-with-check-events-race branch from 91f4f98 to 9207596 Compare August 19, 2022 15:28

tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 19, 2022

jerop reviewed Aug 19, 2022

View reviewed changes

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 19, 2022

tekton-robot merged commit 7dfe4ed into tektoncd:main Aug 19, 2022

Yongxuanzhang mentioned this pull request Oct 21, 2022

fix cloud event flacky unit tests by adding waitgroup #5672

Closed

7 tasks

Yongxuanzhang mentioned this pull request Nov 2, 2022

fix cloud event flaky unit tests by adding waitgroup to fakeclient #5690

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check and/or add to sent cloud event cache in one operation #5313

Check and/or add to sent cloud event cache in one operation #5313

abayer commented Aug 11, 2022 •

edited

Loading

abayer commented Aug 11, 2022

abayer commented Aug 11, 2022

abayer commented Aug 11, 2022

abayer commented Aug 11, 2022

lbernick commented Aug 11, 2022

abayer commented Aug 11, 2022

abayer commented Aug 11, 2022

abayer commented Aug 12, 2022

afrittoli commented Aug 12, 2022

abayer commented Aug 12, 2022

afrittoli commented Aug 12, 2022

abayer commented Aug 12, 2022

tekton-robot commented Aug 12, 2022

abayer commented Aug 12, 2022

tekton-robot commented Aug 15, 2022

abayer commented Aug 15, 2022

abayer commented Aug 15, 2022

abayer commented Aug 17, 2022

tekton-robot commented Aug 18, 2022

abayer commented Aug 19, 2022

jerop Aug 19, 2022

abayer Aug 19, 2022

tekton-robot commented Aug 19, 2022

jerop left a comment

	func IsCloudEventSent(cacheClient lru.Cache, event cloudevents.Event) (bool, error) {
	// ContainsOrAddCloudEvent checks if the event exists in the cache, and if not, adds the event
	func ContainsOrAddCloudEvent(cacheClient lru.Cache, event cloudevents.Event) (bool, error) {

Check and/or add to sent cloud event cache in one operation #5313

Check and/or add to sent cloud event cache in one operation #5313

Conversation

abayer commented Aug 11, 2022 • edited Loading

Changes

Submitter Checklist

Release Notes

abayer commented Aug 11, 2022

abayer commented Aug 11, 2022

abayer commented Aug 11, 2022

abayer commented Aug 11, 2022

lbernick commented Aug 11, 2022

abayer commented Aug 11, 2022

abayer commented Aug 11, 2022

abayer commented Aug 12, 2022

afrittoli commented Aug 12, 2022

abayer commented Aug 12, 2022

afrittoli commented Aug 12, 2022

abayer commented Aug 12, 2022

tekton-robot commented Aug 12, 2022

abayer commented Aug 12, 2022

tekton-robot commented Aug 15, 2022

abayer commented Aug 15, 2022

abayer commented Aug 15, 2022

abayer commented Aug 17, 2022

tekton-robot commented Aug 18, 2022

abayer commented Aug 19, 2022

jerop Aug 19, 2022

Choose a reason for hiding this comment

abayer Aug 19, 2022

Choose a reason for hiding this comment

tekton-robot commented Aug 19, 2022

jerop left a comment

Choose a reason for hiding this comment

abayer commented Aug 11, 2022 •

edited

Loading