Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Cilium E2E Upgrade - no-errors-in-logs - retrieving device lxc_health: Link not found #32689

Closed
giorio94 opened this issue May 23, 2024 · 37 comments · Fixed by #33700
Closed
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! release-blocker/1.16 This issue will prevent the release of the next version of Cilium. severity/low Impact is limited, only affecting unusual configs, or all configs with low impact in prod.
Milestone

Comments

@giorio94
Copy link
Member

CI failure

Hit on #32688
Link: https://github.com/cilium/cilium/actions/runs/9209728129/job/25335026937
Sysdump: cilium-sysdump-7-final.zip

❌ Found 2 logs in kind-kind/kube-system/cilium-d2588 (cilium-agent) matching list of errors that must be investigated:
time="2024-05-23T14:34:10Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2584 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.0.12 ipv6="fd00:10:244::1cd5" k8sPodName=/ subsys=endpoint (1 occurrences)
time="2024-05-23T14:34:10Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2584 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.0.12 ipv6="fd00:10:244::1cd5" k8sPodName=/ subsys=endpoint (1 occurrences)
@giorio94 giorio94 added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels May 23, 2024
@gandro
Copy link
Member

gandro commented May 27, 2024

Happening very frequently on #32725 too

@learnitall
Copy link
Contributor

It looks like this has been consistently failing since May 22nd, with no failures occurring prior. Since May 22nd, here are the number of failures by matrix:

Job Matrix Entry Kernel Num Failures Link
7 bpf-next 25 link
6 6.1 1 link
14 5.14 1 link

nine
eight

@joamaki
Copy link
Contributor

joamaki commented May 30, 2024

I checked through the scheduled tests on main and the first one I found was https://github.com/cilium/cilium/actions/runs/9195482429/job/25291405008. This is right when #32518 was merged that is the last change to where "retrieving device" log message is coming from. @ti-mo or @lmb could you check if this might be the source of this issue?

@learnitall
Copy link
Contributor

@rgo3 I heard you were taking a look so I added you as an assignee, just a heads up

@youngnick
Copy link
Contributor

Hit on #32744
Link: https://github.com/cilium/cilium/actions/runs/9298554153/job/25590680316
Sysdump: cilium-sysdump-7-final.zip

 ❌ Found 2 logs in kind-kind/kube-system/cilium-ddg8l (cilium-agent) matching list of errors that must be investigated:
time="2024-05-30T07:29:45Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1552 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.210 ipv6="fd00:10:244:3::920d" k8sPodName=/ subsys=endpoint (1 occurrences)
time="2024-05-30T07:29:45Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1552 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.210 ipv6="fd00:10:244:3::920d" k8sPodName=/ subsys=endpoint (1 occurrences)

@tommyp1ckles
Copy link
Contributor

@rgo3
Copy link
Contributor

rgo3 commented May 31, 2024

So from my initial investigation yesterday I would say #32518 is what caused this error to show up in the error logs because the log level changed from Warn to Error, but is not the underlying cause of why we can't find lxc_health. Jussi already suspected that in the discussion offline.

From looking at the logs of a sysdump where it happened, I can see some sort of race between loader and initializing the health endpoint:

6103:2024-05-30T19:06:25.246590173Z time="2024-05-30T19:06:25Z" level=debug msg="Killing old health endpoint process" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6104:2024-05-30T19:06:25.247099202Z time="2024-05-30T19:06:25Z" level=debug msg="Killed endpoint process" pid=1037 pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6368:2024-05-30T19:06:26.497522773Z time="2024-05-30T19:06:26Z" level=error msg="Error while reloading endpoint BPF program" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6380:2024-05-30T19:06:26.497697378Z time="2024-05-30T19:06:26Z" level=info msg="generating BPF for endpoint failed, keeping stale directory" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" file-path=2033_next_fail identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6384:2024-05-30T19:06:26.497977680Z time="2024-05-30T19:06:26Z" level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=899.216455ms bpfWaitForELF="26.008µs" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ mapSync=9.941627ms policyCalculation="900.128µs" prepareBuild=7.318288ms proxyConfiguration="11.18µs" proxyPolicyCalculation="167.333µs" proxyWaitForAck=0s reason="updated security labels" subsys=endpoint total=1m51.825101738s waitingForCTClean=59.918287ms waitingForLock=1m50.690587536s
6385:2024-05-30T19:06:26.497982489Z time="2024-05-30T19:06:26Z" level=debug msg="Error regenerating endpoint: retrieving device lxc_health: Link not found" ciliumEndpointName=/ code=Failure containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 endpointState=ready identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ policyRevision=0 subsys=endpoint type=200
6400:2024-05-30T19:06:26.510255562Z time="2024-05-30T19:06:26Z" level=error msg="endpoint regeneration failed" ciliumEndpointName=/ containerID= containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=2033 error="retrieving device lxc_health: Link not found" identity=4 ipv4=10.244.3.192 ipv6="fd00:10:244:3::fa95" k8sPodName=/ subsys=endpoint
6444:2024-05-30T19:06:26.585168967Z time="2024-05-30T19:06:26Z" level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=cilium_health
6445:2024-05-30T19:06:26.585171281Z time="2024-05-30T19:06:26Z" level=debug msg="Didn't find existing device" error="Link not found" subsys=cilium-health-launcher veth=lxc_health
6446:2024-05-30T19:06:26.601161281Z time="2024-05-30T19:06:26Z" level=debug msg="Created veth pair" subsys=endpoint-connector vethPair="[cilium lxc_health]"
6463:2024-05-30T19:06:26.633890277Z time="2024-05-30T19:06:26Z" level=debug msg="Spawning health endpoint with command \"cilium-health-responder\" [\"--listen\" \"4240\" \"--pidfile\" \"/var/run/cilium/state/health-endpoint.pid\"]" subsys=cilium-health-launcher
6507:2024-05-30T19:06:27.687842061Z time="2024-05-30T19:06:27Z" level=debug msg="cilium-health agent running" pidfile=/var/run/cilium/state/health-endpoint.pid subsys=cilium-health-launcher
6508:2024-05-30T19:06:27.687921610Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [route add 10.244.3.45/32 dev cilium]\"" subsys=cilium-health-launcher
6509:2024-05-30T19:06:27.720737217Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [route add 0.0.0.0/0 via 10.244.3.45 mtu 1450 dev cilium]\"" subsys=cilium-health-launcher
6515:2024-05-30T19:06:27.812147656Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [-6 route add fd00:10:244:3::29a9/128 dev cilium]\"" subsys=cilium-health-launcher
6519:2024-05-30T19:06:27.848067069Z time="2024-05-30T19:06:27Z" level=debug msg="Running \"ip [-6 route add ::/0 via fd00:10:244:3::29a9 mtu 1450 dev cilium]\"" subsys=cilium-health-launcher
6726:2024-05-30T19:06:29.196394945Z time="2024-05-30T19:06:29Z" level=info msg="Program cil_from_container attached to device lxc_health using tcx" subsys=datapath-loader

First two log lines are from the health endpoint (re)initialization, cleaning up the old endpoint after the upgrade. Then we see logs from the loader, trying to load the datapath for the health endpoint but failing because the netlink device has been deleted. After that, we successfully recreate the health endpoint (see Created veth pair, Spawning health endpoint...) after which the loader successfully attaches the datapath ... attached to device lxc_health using tcx.

What's not entirely clear to me is why the loader is trying to attach to the health endpoint in the first place. I'm guessing that the cleanup code for the health endpoint is not removing the endpoint from the endpoint manager (or something along these lines) on shutdown but I think that should happen when the agent shuts down.

@joestringer
Copy link
Member

I'm also wondering why the test fails disproportionately with the following configuration of Cilium:

- name: '7'
# renovate: datasource=docker depName=quay.io/lvh-images/kind
kernel: 'bpf-next-20240529.013128'
kube-proxy: 'none'
kpr: 'true'
devices: '{eth0,eth1}'
secondary-network: 'true'
tunnel: 'disabled'
lb-mode: 'snat'
egress-gateway: 'true'
lb-acceleration: 'testing-only'

@borkmann
Copy link
Member

borkmann commented Jun 6, 2024

I'm also wondering why the test fails disproportionately with the following configuration of Cilium:

Given with tcx we switched all attachments to BPF links which should not be the case for <6.4ish kernels. We can make this a release blocker for the final 1.16.0 and I'll help Robin with the investigation + fix in the meantime.

@borkmann borkmann added the release-blocker/1.16 This issue will prevent the release of the next version of Cilium. label Jun 6, 2024
@borkmann borkmann self-assigned this Jun 6, 2024
julianwiedmann added a commit that referenced this issue Jun 12, 2024
The config is reliably failing [0]. Stabilize the workflow so that we can
make it required.

[0] #32689
Signed-off-by: Julian Wiedmann <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Jun 13, 2024
The config is reliably failing [0]. Stabilize the workflow so that we can
make it required.

[0] #32689
Signed-off-by: Julian Wiedmann <[email protected]>
@thorn3r
Copy link
Contributor

thorn3r commented Jun 19, 2024

@joestringer joestringer moved this from Proposed to Assigned in Release blockers Jun 20, 2024
@tommyp1ckles
Copy link
Contributor

As far as I can tell, there's a couple things going on here:

1. The Compilation Lock is being held for a long time...

It appears as though in the affected test tuple, the compilation write lock is being held for quite a long time (from sysdump logs, around 2:30), during the Reinitialize calls.

It's unclear to me why exactly this takes so long (todo: need to narrow down where the Reinitialize is spending most of its time), but from an initial look it looks like the following two procedures are the slowest parts of the re-initialization (both taking ~30-40 seconds):

todo: need to narrow down specifically what's going on in this reinitialize, and how this differs from the passing test runs.

2. Endpoints waiting for the datapath lock will continue regenerating after delete:

An endpoint that is being regenerated will eventually need to acquire the compilation lock, which is a rw-lock that can also be held during a "Reinitialize".

As mentioned, during the failure, the Reinitialize call that holds the write lock always seems to happen right before the lxc_health endpoint is created/generated, so the health endpoint has to wait ~2:30 for this lock to be released before proceeding.

During this time, if a call to (*Endpoint).Delete is called (i.e. from the endpoint cleanup, as a result of the ping timeout (3minutes) in the health-endpoint controller) is made while the endpoint is waiting for the compilation lock then it will continue regeneratingBPF once the rlock is acquired. This is likely why we're seeing the "lxc_health link not found" errors.

This can happen because although endpoint regenerate holds the "build" mutex, the Delete procedure does not. It only holds the regular Endpoint lock - which is actually not held by the regenerateBPF section while handling an Endpoint regenerate. So there's nothing stopping a Endpoint.Delete() from happening while the endpoint is in the regenerateBPF section (i.e. such as waiting for the compilation rlock).

3. Fixing no. 2 makes CI fail on v1.16 "downgrade" rather than main "upgrade" part of test:

On my test branch, I made some changes that would require the Delete to hold the build mutex, as well as cancelling a regenerate if the endpoint was stopped while the endpoint was waiting for the compilation rlock: 77ebe07

What I observed on CI was:

  • With the change, the e2e-upgrade tests still fail, but they fail on the downgrade portion of the test (i.e. on the v1.16-rc image).
  • Without the change, the e2e-upgrade tests fail, but they do so on the upgrade step (i.e. 1.17 or "main" branch)

This tells me that this problem is at least partly responsible for this failure, I'm not 100% sure that adding the buildMutex lock to Delete is a safe thing to do (todo....), but the logic here probably needs some improvement.

However, I don't know why the locks are being held so long, in particular in this test tuple. This does seem specific to test case '7', where I noticed that there is an extra Reinitialize() 0call in the logs where the test fails. This may possibly be related to the device-reloader issues I mentioned in previous comments?

@tommyp1ckles
Copy link
Contributor

tommyp1ckles commented Jul 23, 2024

@borkmann

(One more note, the failure is happening in 1.16.y branch, not in main.. 1.16 -> 1.17 -> 1.16, I'll cleanup the branch, get it merged and retest once it landed in 1.16 to see if it got fixed.)

From my testing, this appears to happen in both, when I did something in v1.17 to alleviate the problem it will then fail in v1.16. Maybe one of the changes on your branch fixed the problem in v1.17?

@borkmann
Copy link
Member

[...]

2. Endpoints waiting for the datapath lock will continue regenerating after delete:

An endpoint that is being regenerated will eventually need to acquire the compilation lock, which is a rw-lock that can also be held during a "Reinitialize".

As mentioned, during the failure, the Reinitialize call that holds the write lock always seems to happen right before the lxc_health endpoint is created/generated, so the health endpoint has to wait ~2:30 for this lock to be released before proceeding.

During this time, if a call to (*Endpoint).Delete is called (i.e. from the endpoint cleanup, as a result of the ping timeout (3minutes) in the health-endpoint controller) is made while the endpoint is waiting for the compilation lock then it will continue regeneratingBPF once the rlock is acquired. This is likely why we're seeing the "lxc_health link not found" errors.

This can happen because although endpoint regenerate holds the "build" mutex, the Delete procedure does not. It only holds the regular Endpoint lock - which is actually not held by the regenerateBPF section while handling an Endpoint regenerate. So there's nothing stopping a Endpoint.Delete() from happening while the endpoint is in the regenerateBPF section (i.e. such as waiting for the compilation rlock).

Regarding regenerateBPF we do seem to check whether and endpoint is alive here, potentially this might be too late / after trying to attach:

	stats.waitingForLock.Start()
	err = e.lockAlive()
	stats.waitingForLock.End(err == nil)
	if err != nil {
		return 0, err
	}
	defer e.unlock()

I presume from CNI side this might never happen since the lxc devices have somewhat unique name and stick around such that collision doesn't happen when CNI plugin gets Del/Add requests. The endpoint deletion sets the endpoint in disconnecting state such that lockAlive should error.

@borkmann
Copy link
Member

Maybe one of the changes on your branch fixed the problem in v1.17?

It improves the timeout in that we reping right after relaunching the health endpoint. I suspect this mitigates the situation somewhat.

borkmann added a commit that referenced this issue Jul 23, 2024
borkmann added a commit that referenced this issue Jul 23, 2024
…nish

Endpoint deletions can race with pending builds as detailed in #32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting.

Kudos to Tom Hadlaw and Robin Goegge for helping with debugging!

Fixes: #32689
Signed-off-by: Daniel Borkmann <[email protected]>
@tommyp1ckles
Copy link
Contributor

Regarding regenerateBPF we do seem to check whether and endpoint is alive here, potentially this might be too late / after trying to attach:

yeah I believe the error is bubbled up from here, prior to the lock check.

@borkmann
Copy link
Member

Agree, I think the way to go would be to hold the ep build mutex upon Delete() (here). This ensures any pending build would complete, and subsequent build after Delete() finished sees ep in Disconnecting and bails out right away.

borkmann added a commit that referenced this issue Jul 23, 2024
…nish

Endpoint deletions can race with pending builds as detailed in #32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: #32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Jul 23, 2024
Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: #32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Jul 23, 2024
…nish

Endpoint deletions can race with pending builds as detailed in #32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: #32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
@github-project-automation github-project-automation bot moved this from Active to Done in Release blockers Jul 23, 2024
joestringer pushed a commit to joestringer/cilium that referenced this issue Jul 24, 2024
[ upstream commit da05633 ]

Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: cilium#32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Joe Stringer <[email protected]>
joestringer pushed a commit to joestringer/cilium that referenced this issue Jul 24, 2024
…nish

[ upstream commit f59fe4e ]

Endpoint deletions can race with pending builds as detailed in cilium#32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: cilium#32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Joe Stringer <[email protected]>
joestringer pushed a commit that referenced this issue Jul 24, 2024
[ upstream commit da05633 ]

Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: #32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Joe Stringer <[email protected]>
joestringer pushed a commit that referenced this issue Jul 24, 2024
…nish

[ upstream commit f59fe4e ]

Endpoint deletions can race with pending builds as detailed in #32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: #32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Signed-off-by: Joe Stringer <[email protected]>
Kaczyniec pushed a commit to Kaczyniec/cilium that referenced this issue Aug 5, 2024
Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: cilium#32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Kaczyniec pushed a commit to Kaczyniec/cilium that referenced this issue Aug 5, 2024
…nish

Endpoint deletions can race with pending builds as detailed in cilium#32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: cilium#32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit to jrajahalme/cilium that referenced this issue Oct 30, 2024
[ upstream commit da05633 ]

Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: cilium#32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit to jrajahalme/cilium that referenced this issue Oct 30, 2024
…nish

[ upstream commit f59fe4e ]

Endpoint deletions can race with pending builds as detailed in cilium#32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: cilium#32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit to jrajahalme/cilium that referenced this issue Oct 30, 2024
[ upstream commit da05633 ]

Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: cilium#32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit to jrajahalme/cilium that referenced this issue Oct 30, 2024
…nish

[ upstream commit f59fe4e ]

Endpoint deletions can race with pending builds as detailed in cilium#32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: cilium#32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit to jrajahalme/cilium that referenced this issue Oct 30, 2024
[ upstream commit da05633 ]

Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: cilium#32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit to jrajahalme/cilium that referenced this issue Oct 30, 2024
…nish

[ upstream commit f59fe4e ]

Endpoint deletions can race with pending builds as detailed in cilium#32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: cilium#32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit that referenced this issue Oct 31, 2024
[ upstream commit da05633 ]

Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: #32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit that referenced this issue Oct 31, 2024
…nish

[ upstream commit f59fe4e ]

Endpoint deletions can race with pending builds as detailed in #32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: #32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit that referenced this issue Nov 6, 2024
[ upstream commit da05633 ]

Currently on shutdown we don't remove the health endpoint from the
endpoint manager but do remove the corresponding netlink interface.
Note that validateEndpoint() special cases the health endpoint and
does remove the old state directory if it existed before. But
independent of that it would be better to just streamline cleanup
via cleanupHealthEndpoint() for all cases.

Related: #32689
Signed-off-by: Robin Gögge <[email protected]>
Co-developed-by: Daniel Borkmann <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
jrajahalme pushed a commit that referenced this issue Nov 6, 2024
…nish

[ upstream commit f59fe4e ]

Endpoint deletions can race with pending builds as detailed in #32689.
This is due to the fact that the Delete() call does not wait for them
to finish. Instead, it tears down the endpoint, downs the device, removes
tc(x), etc. This is not correct since the inflight from regenerate() can
later still attempt to attach something to the given device. This in itself
is not problematic as the CNI will just remove the device and the kernel
removes all objects along with it. However, this can result in sporadic
agent error logs that a given lxc* device has already disappeared when
the agent tries to attach, but Delete() resp the CNI Del finished. We've
seen this in particular in case of lxc_health. The health endpoint is not
managed by the CNI plugin, but the agent instead. What appears to happen
is that when health endpoint is cleaned, it's removed from the ep manager,
and subsequently the device is removed.. however any ongoing regenerate()
later tries to attach to a non-existing device. Given the Delete() updates
the endpoint state, we also need to wrap this with the build mutex lock.
This is to wait for any pending regenerate() to finish before setting the
new ep state, and any new build requests will bail out early as they see
that the endpoint is disconnecting. Kudos to Tom Hadlaw and Robin Goegge!

Fixes: #32689
Co-developed-by: Tom Hadlaw <[email protected]>
Signed-off-by: Tom Hadlaw <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! release-blocker/1.16 This issue will prevent the release of the next version of Cilium. severity/low Impact is limited, only affecting unusual configs, or all configs with low impact in prod.
Projects
Archived in project
10 participants