-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check and/or add to sent cloud event cache in one operation #5313
Check and/or add to sent cloud event cache in one operation #5313
Conversation
/kind flake |
/retest |
Ok, the flake count went down considerably, but I also saw a case of an event being set twice even though it was theoretically cached. Digging into that now... |
/retest There's definitely something else wrong with the event cache - I'm seeing two attempts to send an identical event, which it shouldn't be doing. Digging further! |
What's the actual race condition here? This seems like it might be a bandaid over most of the test flakes, but is it possible to fix the underlying problem? |
That's part of what I'm still trying to figure out. If I am right about the retries (which I'm less sure of now!), then the race would be that sometimes, under load, sending the cloud event fails the first time and gets resent 10ms later, which is fine. That's why we have the retries there - the end result is correct, it's just the test scenario that has problems due to expecting all the events to be delivered by the time the But I have now seen at least one case (with a bunch of extra debug logging tossed in to give me better visibility into what's actually happening) of an event getting sent successfully, then when the second reconcile happens, it shouldn't be sending the event again because it's already in the cloud event controller cache, but it is, in fact, being sent again. So that's where I'm poking around in right now. |
Also, to make it more fun, in my test setup, |
So I've added 1cdc4fa to this PR - as I detailed in #5160, I've narrowed down the cause of the That said, I think the sleep in the original commit here is still valuable for other event test flakiness, such as |
Thanks for this. I wonder if we should add the mutex further down in the code structure though, because this will hold the entire reconcile cycle, so it will basically force reconciles to not happen in parallel as long as long as cloudevents for run is enabled. It feels like this would slow down things a lot because the event will be taken from the queue and sit there waiting. Perhaps it would be enough to add locks to the methods in https://github.com/tektoncd/pipeline/blob/main/pkg/reconciler/events/cache/cache.go, similar to what the original cache class does? |
Yeah, I don't love this location, but the thing is that we do actually need to lock the whole of
No, that wouldn't work, because it's not that we're having problems because of the cache calls directly. The problem is that pipeline/pkg/reconciler/events/cloudevent/cloud_event_controller.go Lines 148 to 174 in 45b7c37
cache.IsCloudEventSent(...) and into ceClient.Send(...) , and before call 1 gets past that to get to cache.AddEventSentToCache(...) , call 2 gets to cache.IsCloudEventSent(...) . So really, we want a lock on that go func()... .
|
So, I'm thinking that we should change this function to add the event to the cache before we invoke the actual send function, since that might do retries and take an arbitrary amount of time to complete. I would propose to:
|
Yeah, that all sounds good to me. Lemme do that. |
1cdc4fa
to
344cc1d
Compare
The following is the coverage report on the affected files.
|
/retest |
344cc1d
to
91f4f98
Compare
The following is the coverage report on the affected files.
|
/retest |
/assign @afrittoli @jerop |
/assign @vdemeester |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ping @afrittoli @jerop @lbernick =) |
pkg/reconciler/events/cache/cache.go
Outdated
cacheClient.Add(eventKey, nil) | ||
return nil | ||
} | ||
|
||
// IsCloudEventSent checks if the event exists in the cache | ||
func IsCloudEventSent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we need to update this function name and description
func IsCloudEventSent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) { | |
// ContainsOrAddCloudEvent checks if the event exists in the cache, and if not, adds the event | |
func ContainsOrAddCloudEvent(cacheClient *lru.Cache, event *cloudevents.Event) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! Doing so now.
fixes (hopefully) tektoncd#5160 At least some of the `TestReconcile_CloudEvents` failures in `pkg/reconciler/run/run_test.go` we've been seeing are due to a race condition around the event cache used to prevent double-sending `Run` events. This is because we check the cache, then send the event, and finally write to the cache, creating a race where an initial attempt to send an event may not get to writing to the cache until after a second attempt has gotten past checking the cache. This fixes that by writing to the cache immediately after checking the cache. In my test bed, with this fix (and a 50ms sleep added to `CheckEventsOrdered` and `CheckEventsUnordered` to make sure that we leave some leeway for asynchronous event delivery to complete before checking events), `TestReconcile_CloudEvents` went from failing a relatively consistent one out of every ten or so runs to failing 20 times out of 1,800 runs over the weekend. So there's still some flakiness there, but much, much less than there was previously Signed-off-by: Andrew Bayer <[email protected]> Check and/or add to sent cloud event cache in one operation Signed-off-by: Andrew Bayer <[email protected]>
91f4f98
to
9207596
Compare
The following is the coverage report on the affected files.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
thanks @abayer!
Changes
fixes (hopefully) #5160
At least some of the
TestReconcile_CloudEvents
failures inpkg/reconciler/run/run_test.go
we've been seeing are due to a race condition around the event cache used to prevent double-sendingRun
events. This is because we check the cache, then send the event, and finally write to the cache, creating a race where an initial attempt to send an event may not get to writing to the cache until after a second attempt has gotten past checking the cache. This fixes that by writing to the cache immediately after checking the cache.In my test bed, with this fix (and a 50ms sleep added to
CheckEventsOrdered
andCheckEventsUnordered
to make sure that we leave some leeway for asynchronous event delivery to complete before checking events),TestReconcile_CloudEvents
went from failing a relatively consistent one out of every ten or so runs to failing 20 times out of 1,800 runs over the weekend. So there's still some flakiness there, but much, much less than there was previously/kind flake
Submitter Checklist
As the author of this PR, please check off the items in this checklist:
functionality, content, code)
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tepRelease Notes