CI: Cilium E2E Upgrade - no-errors-in-logs - retrieving device lxc_health: Link not found #32689

giorio94 · 2024-05-23T16:00:09Z

CI failure

Hit on #32688
Link: https://github.com/cilium/cilium/actions/runs/9209728129/job/25335026937
Sysdump: cilium-sysdump-7-final.zip

❌ Found 2 logs in kind-kind/kube-system/cilium-d2588 (cilium-agent) matching list of errors that must be investigated:
time="2024-05-23T14:34:10Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2584 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.0.12 ipv6="fd00:10:244::1cd5" k8sPodName=/ subsys=endpoint (1 occurrences)
time="2024-05-23T14:34:10Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2584 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.0.12 ipv6="fd00:10:244::1cd5" k8sPodName=/ subsys=endpoint (1 occurrences)

The text was updated successfully, but these errors were encountered:

gandro · 2024-05-27T12:38:40Z

Happening very frequently on #32725 too

learnitall · 2024-05-29T15:35:53Z

It looks like this has been consistently failing since May 22nd, with no failures occurring prior. Since May 22nd, here are the number of failures by matrix:

Job Matrix Entry	Kernel	Num Failures	Link
7	bpf-next	25	link
6	6.1	1	link
14	5.14	1	link

joamaki · 2024-05-30T08:45:34Z

I checked through the scheduled tests on main and the first one I found was https://github.com/cilium/cilium/actions/runs/9195482429/job/25291405008. This is right when #32518 was merged that is the last change to where "retrieving device" log message is coming from. @ti-mo or @lmb could you check if this might be the source of this issue?

learnitall · 2024-05-30T14:48:41Z

@rgo3 I heard you were taking a look so I added you as an assignee, just a heads up

youngnick · 2024-05-31T00:44:44Z

Hit on #32744
Link: https://github.com/cilium/cilium/actions/runs/9298554153/job/25590680316
Sysdump: cilium-sysdump-7-final.zip

 ❌ Found 2 logs in kind-kind/kube-system/cilium-ddg8l (cilium-agent) matching list of errors that must be investigated:
time="2024-05-30T07:29:45Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1552 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.210 ipv6="fd00:10:244:3::920d" k8sPodName=/ subsys=endpoint (1 occurrences)
time="2024-05-30T07:29:45Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1552 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.210 ipv6="fd00:10:244:3::920d" k8sPodName=/ subsys=endpoint (1 occurrences)

tommyp1ckles · 2024-05-31T02:28:27Z

Hit this here I think: https://github.com/cilium/cilium/actions/runs/9307792492/job/25619833558

rgo3 · 2024-05-31T07:44:31Z

So from my initial investigation yesterday I would say #32518 is what caused this error to show up in the error logs because the log level changed from Warn to Error, but is not the underlying cause of why we can't find lxc_health. Jussi already suspected that in the discussion offline.

From looking at the logs of a sysdump where it happened, I can see some sort of race between loader and initializing the health endpoint:

6103:2024-05-30T19:06:25.246590173Z time="2024-05-30T19:06:25Z" level=debug msg="Killing old health endpoint process" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6104:2024-05-30T19:06:25.247099202Z time="2024-05-30T19:06:25Z" level=debug msg="Killed endpoint process" pid=1037 pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6368:2024-05-30T19:06:26.497522773Z time="2024-05-30T19:06:26Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6380:2024-05-30T19:06:26.497697378Z time="2024-05-30T19:06:26Z" level=info msg="generating BPF for endpoint failed, keeping stale directory" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" file-path=2033_next_fail identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6384:2024-05-30T19:06:26.497977680Z time="2024-05-30T19:06:26Z" level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=899.216455ms bpfWaitForELF="26.008µs" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ mapSync=9.941627ms policyCalculation="900.128µs" prepareBuild=7.318288ms proxyConfiguration="11.18µs" proxyPolicyCalculation="167.333µs" proxyWaitForAck=0s reason="updated security labels" subsys=endpoint total=1m51.825101738s waitingForCTClean=59.918287ms waitingForLock=1m50.690587536s
6385:2024-05-30T19:06:26.497982489Z time="2024-05-30T19:06:26Z" level=debug msg="Error regenerating endpoint: retrieving device lxc_health: Link not found" ciliumEndpointName=/ code=Failure containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 endpointState=ready identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ policyRevision=0 subsys=endpoint type=200
6400:2024-05-30T19:06:26.510255562Z time="2024-05-30T19:06:26Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6444:2024-05-30T19:06:26.585168967Z time="2024-05-30T19:06:26Z" level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=cilium_health
6445:2024-05-30T19:06:26.585171281Z time="2024-05-30T19:06:26Z" level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=lxc_health
6446:2024-05-30T19:06:26.601161281Z time="2024-05-30T19:06:26Z" level=debug msg="Created veth pair" subsys=endpoint-connector vethPair="[cilium lxc_health]"
6463:2024-05-30T19:06:26.633890277Z time="2024-05-30T19:06:26Z" level=debug msg="Spawning health endpoint with command \"cilium-health-responder\" [\"--listen\" \"4240\" \"--pidfile\" \"/var/run/cilium/state/health-endpoint.pid\"]" subsys=cilium-health-launcher
6507:2024-05-30T19:06:27.687842061Z time="2024-05-30T19:06:27Z" level=debug msg="cilium-health agent running" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6508:2024-05-30T19:06:27.687921610Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [route add 10.244.3.45/32 dev cilium]\"" subsys=cilium-health-launcher
6509:2024-05-30T19:06:27.720737217Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [route add 0.0.0.0/0 via 10.244.3.45 mtu 1450 dev cilium]\"" subsys=cilium-health-launcher
6515:2024-05-30T19:06:27.812147656Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [-6 route add fd00:10:244:3::29a9/128 dev cilium]\"" subsys=cilium-health-launcher
6519:2024-05-30T19:06:27.848067069Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [-6 route add ::/0 via fd00:10:244:3::29a9 mtu 1450 dev cilium]\"" subsys=cilium-health-launcher
6726:2024-05-30T19:06:29.196394945Z time="2024-05-30T19:06:29Z" level=info msg="Program cil_from_container attached to device lxc_health using tcx" subsys=datapath-loader

First two log lines are from the health endpoint (re)initialization, cleaning up the old endpoint after the upgrade. Then we see logs from the loader, trying to load the datapath for the health endpoint but failing because the netlink device has been deleted. After that, we successfully recreate the health endpoint (see Created veth pair, Spawning health endpoint...) after which the loader successfully attaches the datapath ... attached to device lxc_health using tcx.

What's not entirely clear to me is why the loader is trying to attach to the health endpoint in the first place. I'm guessing that the cleanup code for the health endpoint is not removing the endpoint from the endpoint manager (or something along these lines) on shutdown but I think that should happen when the agent shuts down.

joestringer · 2024-06-04T23:45:45Z

I'm also wondering why the test fails disproportionately with the following configuration of Cilium:

cilium/.github/workflows/tests-e2e-upgrade.yaml

Lines 154 to 164 in 8de7a90

    
           - name: '7' 
        
             # renovate: datasource=docker depName=quay.io/lvh-images/kind 
        
             kernel: 'bpf-next-20240529.013128' 
        
             kube-proxy: 'none' 
        
             kpr: 'true' 
        
             devices: '{eth0,eth1}' 
        
             secondary-network: 'true' 
        
             tunnel: 'disabled' 
        
             lb-mode: 'snat' 
        
             egress-gateway: 'true' 
        
             lb-acceleration: 'testing-only'

borkmann · 2024-06-06T13:25:36Z

I'm also wondering why the test fails disproportionately with the following configuration of Cilium:

Given with tcx we switched all attachments to BPF links which should not be the case for <6.4ish kernels. We can make this a release blocker for the final 1.16.0 and I'll help Robin with the investigation + fix in the meantime.

The config is reliably failing [0]. Stabilize the workflow so that we can make it required. [0] #32689 Signed-off-by: Julian Wiedmann <[email protected]>

thorn3r · 2024-06-19T13:34:29Z

Another hit on #33240
Workflow: https://github.com/cilium/cilium/actions/runs/9572595235/job/26392256829
Sysdump: cilium-sysdump-6-final.zip

tommyp1ckles · 2024-07-23T06:15:45Z

As far as I can tell, there's a couple things going on here:

1. The Compilation Lock is being held for a long time...

It appears as though in the affected test tuple, the compilation write lock is being held for quite a long time (from sysdump logs, around 2:30), during the Reinitialize calls.

It's unclear to me why exactly this takes so long (todo: need to narrow down where the Reinitialize is spending most of its time), but from an initial look it looks like the following two procedures are the slowest parts of the re-initialization (both taking ~30-40 seconds):

todo: need to narrow down specifically what's going on in this reinitialize, and how this differs from the passing test runs.

2. Endpoints waiting for the datapath lock will continue regenerating after delete:

An endpoint that is being regenerated will eventually need to acquire the compilation lock, which is a rw-lock that can also be held during a "Reinitialize".

As mentioned, during the failure, the Reinitialize call that holds the write lock always seems to happen right before the lxc_health endpoint is created/generated, so the health endpoint has to wait ~2:30 for this lock to be released before proceeding.

During this time, if a call to (*Endpoint).Delete is called (i.e. from the endpoint cleanup, as a result of the ping timeout (3minutes) in the health-endpoint controller) is made while the endpoint is waiting for the compilation lock then it will continue regeneratingBPF once the rlock is acquired. This is likely why we're seeing the "lxc_health link not found" errors.

This can happen because although endpoint regenerate holds the "build" mutex, the Delete procedure does not. It only holds the regular Endpoint lock - which is actually not held by the regenerateBPF section while handling an Endpoint regenerate. So there's nothing stopping a Endpoint.Delete() from happening while the endpoint is in the regenerateBPF section (i.e. such as waiting for the compilation rlock).

3. Fixing no. 2 makes CI fail on v1.16 "downgrade" rather than main "upgrade" part of test:

On my test branch, I made some changes that would require the Delete to hold the build mutex, as well as cancelling a regenerate if the endpoint was stopped while the endpoint was waiting for the compilation rlock: 77ebe07

What I observed on CI was:

With the change, the e2e-upgrade tests still fail, but they fail on the downgrade portion of the test (i.e. on the v1.16-rc image).
Without the change, the e2e-upgrade tests fail, but they do so on the upgrade step (i.e. 1.17 or "main" branch)

This tells me that this problem is at least partly responsible for this failure, I'm not 100% sure that adding the buildMutex lock to Delete is a safe thing to do (todo....), but the logic here probably needs some improvement.

However, I don't know why the locks are being held so long, in particular in this test tuple. This does seem specific to test case '7', where I noticed that there is an extra Reinitialize() 0call in the logs where the test fails. This may possibly be related to the device-reloader issues I mentioned in previous comments?

tommyp1ckles · 2024-07-23T06:17:05Z

@borkmann

(One more note, the failure is happening in 1.16.y branch, not in main.. 1.16 -> 1.17 -> 1.16, I'll cleanup the branch, get it merged and retest once it landed in 1.16 to see if it got fixed.)

From my testing, this appears to happen in both, when I did something in v1.17 to alleviate the problem it will then fail in v1.16. Maybe one of the changes on your branch fixed the problem in v1.17?

borkmann · 2024-07-23T08:19:06Z

[...]

2. Endpoints waiting for the datapath lock will continue regenerating after delete:

An endpoint that is being regenerated will eventually need to acquire the compilation lock, which is a rw-lock that can also be held during a "Reinitialize".

As mentioned, during the failure, the Reinitialize call that holds the write lock always seems to happen right before the lxc_health endpoint is created/generated, so the health endpoint has to wait ~2:30 for this lock to be released before proceeding.

During this time, if a call to (*Endpoint).Delete is called (i.e. from the endpoint cleanup, as a result of the ping timeout (3minutes) in the health-endpoint controller) is made while the endpoint is waiting for the compilation lock then it will continue regeneratingBPF once the rlock is acquired. This is likely why we're seeing the "lxc_health link not found" errors.

This can happen because although endpoint regenerate holds the "build" mutex, the Delete procedure does not. It only holds the regular Endpoint lock - which is actually not held by the regenerateBPF section while handling an Endpoint regenerate. So there's nothing stopping a Endpoint.Delete() from happening while the endpoint is in the regenerateBPF section (i.e. such as waiting for the compilation rlock).

Regarding regenerateBPF we do seem to check whether and endpoint is alive here, potentially this might be too late / after trying to attach:

	stats.waitingForLock.Start()
	err = e.lockAlive()
	stats.waitingForLock.End(err == nil)
	if err != nil {
		return 0, err
	}
	defer e.unlock()

I presume from CNI side this might never happen since the lxc devices have somewhat unique name and stick around such that collision doesn't happen when CNI plugin gets Del/Add requests. The endpoint deletion sets the endpoint in disconnecting state such that lockAlive should error.

borkmann · 2024-07-23T08:52:00Z

Maybe one of the changes on your branch fixed the problem in v1.17?

It improves the timeout in that we reping right after relaunching the health endpoint. I suspect this mitigates the situation somewhat.

…nish Fixes: #32689 Signed-off-by: Daniel Borkmann <[email protected]>

…nish Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge for helping with debugging! Fixes: #32689 Signed-off-by: Daniel Borkmann <[email protected]>

tommyp1ckles · 2024-07-23T15:42:40Z

Regarding regenerateBPF we do seem to check whether and endpoint is alive here, potentially this might be too late / after trying to attach:

yeah I believe the error is bubbled up from here, prior to the lock check.

borkmann · 2024-07-23T17:25:55Z

Agree, I think the way to go would be to hold the ep build mutex upon Delete() (here). This ensures any pending build would complete, and subsequent build after Delete() finished sees ep in Disconnecting and bails out right away.

…nish Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: #32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

…nish Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Joe Stringer <[email protected]>

…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Joe Stringer <[email protected]>

[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: #32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Joe Stringer <[email protected]>

…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Joe Stringer <[email protected]>

Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

…nish Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: cilium#32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in cilium#32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: cilium#32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: #32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

[ upstream commit da05633 ] Currently on shutdown we don't remove the health endpoint from the endpoint manager but do remove the corresponding netlink interface. Note that validateEndpoint() special cases the health endpoint and does remove the old state directory if it existed before. But independent of that it would be better to just streamline cleanup via cleanupHealthEndpoint() for all cases. Related: #32689 Signed-off-by: Robin Gögge <[email protected]> Co-developed-by: Daniel Borkmann <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

…nish [ upstream commit f59fe4e ] Endpoint deletions can race with pending builds as detailed in #32689. This is due to the fact that the Delete() call does not wait for them to finish. Instead, it tears down the endpoint, downs the device, removes tc(x), etc. This is not correct since the inflight from regenerate() can later still attempt to attach something to the given device. This in itself is not problematic as the CNI will just remove the device and the kernel removes all objects along with it. However, this can result in sporadic agent error logs that a given lxc* device has already disappeared when the agent tries to attach, but Delete() resp the CNI Del finished. We've seen this in particular in case of lxc_health. The health endpoint is not managed by the CNI plugin, but the agent instead. What appears to happen is that when health endpoint is cleaned, it's removed from the ep manager, and subsequently the device is removed.. however any ongoing regenerate() later tries to attach to a non-existing device. Given the Delete() updates the endpoint state, we also need to wrap this with the build mutex lock. This is to wait for any pending regenerate() to finish before setting the new ep state, and any new build requests will bail out early as they see that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge! Fixes: #32689 Co-developed-by: Tom Hadlaw <[email protected]> Signed-off-by: Tom Hadlaw <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]>

giorio94 added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 23, 2024

harsimran-pabla mentioned this issue May 24, 2024

bgpv2: removing v2Enable feature flag #32692

Merged

learnitall assigned rgo3 May 30, 2024

jschwinger233 mentioned this issue May 31, 2024

ipsec: minor cleanups #31390

Merged

ysksuzuki mentioned this issue Jun 5, 2024

ci: test egress-gateway-with-l7-policy #32898

Closed

YutaroHayakawa mentioned this issue Jun 6, 2024

helm: Decouple sysctlfix from cgroup.autoMount #32866

Merged

borkmann added the release-blocker/1.16 This issue will prevent the release of the next version of Cilium. label Jun 6, 2024

borkmann self-assigned this Jun 6, 2024

borkmann mentioned this issue Jun 6, 2024

cilium: netkit support #32429

Merged

gentoo-root mentioned this issue Jun 7, 2024

.github: No need to install Docker in complexity-test manually anymore #32067

Merged

jschwinger233 mentioned this issue Jun 10, 2024

ci: Add IPsec leak detection for ci-ipsec-e2e #32930

Merged

YutaroHayakawa mentioned this issue Jun 11, 2024

bgpv2: Allow empty advertisement #32997

Merged

rastislavs mentioned this issue Jun 11, 2024

bgpv2: Remove node selector check from v2 PodCIDRReconciler #33043

Merged

julianwiedmann added a commit that referenced this issue Jun 12, 2024

gh: e2e-upgrade: disable config 7

6084376

The config is reliably failing [0]. Stabilize the workflow so that we can make it required. [0] #32689 Signed-off-by: Julian Wiedmann <[email protected]>

julianwiedmann mentioned this issue Jun 12, 2024

gh: e2e-upgrade: disable config 7 #33096

Merged

github-merge-queue bot pushed a commit that referenced this issue Jun 13, 2024

gh: e2e-upgrade: disable config 7

3f8585a

The config is reliably failing [0]. Stabilize the workflow so that we can make it required. [0] #32689 Signed-off-by: Julian Wiedmann <[email protected]>

joestringer added this to Release blockers Jun 20, 2024

github-project-automation bot moved this to Proposed in Release blockers Jun 20, 2024

joestringer moved this from Proposed to Assigned in Release blockers Jun 20, 2024

borkmann added a commit that referenced this issue Jul 23, 2024

cilium: Extend endpoint deletion to wait for any ongoing builds to fi…

87066ef

…nish Fixes: #32689 Signed-off-by: Daniel Borkmann <[email protected]>

borkmann mentioned this issue Jul 23, 2024

cilium: Small health cleanup improvements #33700

Merged

joestringer closed this as completed in #33700 Jul 23, 2024

github-project-automation bot moved this from Active to Done in Release blockers Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Cilium E2E Upgrade - no-errors-in-logs - retrieving device lxc_health: Link not found #32689

CI: Cilium E2E Upgrade - no-errors-in-logs - retrieving device lxc_health: Link not found #32689

giorio94 commented May 23, 2024

gandro commented May 27, 2024

learnitall commented May 29, 2024

joamaki commented May 30, 2024 •

edited

Loading

learnitall commented May 30, 2024

youngnick commented May 31, 2024

tommyp1ckles commented May 31, 2024

rgo3 commented May 31, 2024

joestringer commented Jun 4, 2024

borkmann commented Jun 6, 2024 •

edited

Loading

thorn3r commented Jun 19, 2024

tommyp1ckles commented Jul 23, 2024

tommyp1ckles commented Jul 23, 2024 •

edited

Loading

borkmann commented Jul 23, 2024

2. Endpoints waiting for the datapath lock will continue regenerating after delete:

borkmann commented Jul 23, 2024

tommyp1ckles commented Jul 23, 2024

borkmann commented Jul 23, 2024

CI: Cilium E2E Upgrade - no-errors-in-logs - retrieving device lxc_health: Link not found #32689

CI: Cilium E2E Upgrade - no-errors-in-logs - retrieving device lxc_health: Link not found #32689

Comments

giorio94 commented May 23, 2024

CI failure

gandro commented May 27, 2024

learnitall commented May 29, 2024

joamaki commented May 30, 2024 • edited Loading

learnitall commented May 30, 2024

youngnick commented May 31, 2024

tommyp1ckles commented May 31, 2024

rgo3 commented May 31, 2024

joestringer commented Jun 4, 2024

borkmann commented Jun 6, 2024 • edited Loading

thorn3r commented Jun 19, 2024

tommyp1ckles commented Jul 23, 2024

1. The Compilation Lock is being held for a long time...

2. Endpoints waiting for the datapath lock will continue regenerating after delete:

3. Fixing no. 2 makes CI fail on v1.16 "downgrade" rather than main "upgrade" part of test:

tommyp1ckles commented Jul 23, 2024 • edited Loading

borkmann commented Jul 23, 2024

2. Endpoints waiting for the datapath lock will continue regenerating after delete:

borkmann commented Jul 23, 2024

tommyp1ckles commented Jul 23, 2024

borkmann commented Jul 23, 2024

joamaki commented May 30, 2024 •

edited

Loading

borkmann commented Jun 6, 2024 •

edited

Loading

tommyp1ckles commented Jul 23, 2024 •

edited

Loading