Skip to content

Commit

Permalink
cilium: Extend endpoint deletion to wait for any ongoing builds to fi…
Browse files Browse the repository at this point in the history
…nish

[ upstream commit f59fe4e ]

Endpoint deletions can race with pending builds as detailed in #32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: #32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
  • Loading branch information
borkmann authored and jrajahalme committed Oct 31, 2024
1 parent 81d6f9d commit 34aca96
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions pkg/endpoint/endpoint.go
Original file line number Diff line number Diff line change
Expand Up @@ -2434,6 +2434,12 @@ func (e *Endpoint) Delete(conf DeleteConfig) []error {

e.Stop()

// Wait for any pending endpoint regenerate() calls to finish. The
// latter bails out after taking the lock when it detects that the
// endpoint state is disconnecting.
e.buildMutex.Lock()
defer e.buildMutex.Unlock()

// Lock out any other writers to the endpoint. In case multiple delete
// requests have been enqueued, have all of them except the first
// return here. Ignore the request if the endpoint is already
Expand Down

0 comments on commit 34aca96

Please sign in to comment.