Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
cilium: Extend endpoint deletion to wait for any ongoing builds to fi…
…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
- Loading branch information