-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Cilium E2E Upgrade - no-errors-in-logs - retrieving device lxc_health: Link not found #32689
Comments
Happening very frequently on #32725 too |
I checked through the scheduled tests on main and the first one I found was https://github.com/cilium/cilium/actions/runs/9195482429/job/25291405008. This is right when #32518 was merged that is the last change to where "retrieving device" log message is coming from. @ti-mo or @lmb could you check if this might be the source of this issue? |
@rgo3 I heard you were taking a look so I added you as an assignee, just a heads up |
Hit on #32744
|
Hit this here I think: https://github.com/cilium/cilium/actions/runs/9307792492/job/25619833558 |
So from my initial investigation yesterday I would say #32518 is what caused this error to show up in the error logs because the log level changed from From looking at the logs of a sysdump where it happened, I can see some sort of race between loader and initializing the health endpoint:
First two log lines are from the health endpoint (re)initialization, cleaning up the old endpoint after the upgrade. Then we see logs from the loader, trying to load the datapath for the health endpoint but failing because the netlink device has been deleted. After that, we successfully recreate the health endpoint (see What's not entirely clear to me is why the loader is trying to attach to the health endpoint in the first place. I'm guessing that the cleanup code for the health endpoint is not removing the endpoint from the endpoint manager (or something along these lines) on shutdown but I think that should happen when the agent shuts down. |
I'm also wondering why the test fails disproportionately with the following configuration of Cilium: cilium/.github/workflows/tests-e2e-upgrade.yaml Lines 154 to 164 in 8de7a90
|
Given with tcx we switched all attachments to BPF links which should not be the case for <6.4ish kernels. We can make this a release blocker for the final 1.16.0 and I'll help Robin with the investigation + fix in the meantime. |
The config is reliably failing [0]. Stabilize the workflow so that we can make it required. [0] #32689 Signed-off-by: Julian Wiedmann <[email protected]>
The config is reliably failing [0]. Stabilize the workflow so that we can make it required. [0] #32689 Signed-off-by: Julian Wiedmann <[email protected]>
Another hit on #33240 |
As far as I can tell, there's a couple things going on here: 1. The Compilation Lock is being held for a long time...It appears as though in the affected test tuple, the compilation write lock is being held for quite a long time (from sysdump logs, around 2:30), during the Reinitialize calls. It's unclear to me why exactly this takes so long (todo: need to narrow down where the Reinitialize is spending most of its time), but from an initial look it looks like the following two procedures are the slowest parts of the re-initialization (both taking ~30-40 seconds): todo: need to narrow down specifically what's going on in this reinitialize, and how this differs from the passing test runs. 2. Endpoints waiting for the datapath lock will continue regenerating after delete:An endpoint that is being regenerated will eventually need to acquire the compilation lock, which is a rw-lock that can also be held during a "Reinitialize". As mentioned, during the failure, the Reinitialize call that holds the write lock always seems to happen right before the lxc_health endpoint is created/generated, so the health endpoint has to wait ~2:30 for this lock to be released before proceeding. During this time, if a call to (*Endpoint).Delete is called (i.e. from the endpoint cleanup, as a result of the ping timeout (3minutes) in the health-endpoint controller) is made while the endpoint is waiting for the compilation lock then it will continue regeneratingBPF once the rlock is acquired. This is likely why we're seeing the "lxc_health link not found" errors. This can happen because although endpoint regenerate holds the "build" mutex, the Delete procedure does not. It only holds the regular Endpoint lock - which is actually not held by the regenerateBPF section while handling an Endpoint regenerate. So there's nothing stopping a Endpoint.Delete() from happening while the endpoint is in the regenerateBPF section (i.e. such as waiting for the compilation rlock). 3. Fixing no. 2 makes CI fail on v1.16 "downgrade" rather than main "upgrade" part of test:On my test branch, I made some changes that would require the Delete to hold the build mutex, as well as cancelling a regenerate if the endpoint was stopped while the endpoint was waiting for the compilation rlock: 77ebe07 What I observed on CI was:
This tells me that this problem is at least partly responsible for this failure, I'm not 100% sure that adding the buildMutex lock to Delete is a safe thing to do (todo....), but the logic here probably needs some improvement. However, I don't know why the locks are being held so long, in particular in this test tuple. This does seem specific to test case '7', where I noticed that there is an extra Reinitialize() 0call in the logs where the test fails. This may possibly be related to the device-reloader issues I mentioned in previous comments? |
From my testing, this appears to happen in both, when I did something in v1.17 to alleviate the problem it will then fail in v1.16. Maybe one of the changes on your branch fixed the problem in v1.17? |
[...]
Regarding regenerateBPF we do seem to check whether and endpoint is alive here, potentially this might be too late / after trying to attach:
I presume from CNI side this might never happen since the lxc devices have somewhat unique name and stick around such that collision doesn't happen when CNI plugin gets Del/Add requests. The endpoint deletion sets the endpoint in disconnecting state such that lockAlive should error. |
It improves the timeout in that we reping right after relaunching the health endpoint. I suspect this mitigates the situation somewhat. |
…nish Fixes: #32689 Signed-off-by: Daniel Borkmann <[email protected]>
…nish Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge for helping with debugging! Fixes: #32689 Signed-off-by: Daniel Borkmann <[email protected]>
Agree, I think the way to go would be to hold the ep build mutex upon Delete() (here). This ensures any pending build would complete, and subsequent build after Delete() finished sees ep in Disconnecting and bails out right away. |
…nish Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: #32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
…nish Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Joe Stringer <[email protected]>
…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Joe Stringer <[email protected]>
[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: #32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Joe Stringer <[email protected]>
…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Joe Stringer <[email protected]>
Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
…nish Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: #32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: #32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>
CI failure
Hit on #32688
Link: https://github.com/cilium/cilium/actions/runs/9209728129/job/25335026937
Sysdump: cilium-sysdump-7-final.zip
The text was updated successfully, but these errors were encountered: