Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTBR Addon Crashing #132124

Closed
josh-blake opened this issue Dec 3, 2024 · 49 comments · Fixed by home-assistant/supervisor#5547
Closed

OTBR Addon Crashing #132124

josh-blake opened this issue Dec 3, 2024 · 49 comments · Fixed by home-assistant/supervisor#5547

Comments

@josh-blake
Copy link

josh-blake commented Dec 3, 2024

The problem

OTBR Addon fails to load with platformConfigureTunDevice() at netif.cpp:2022: Operation not permitted

What version of Home Assistant Core has the issue?

2024.11.3

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Supervised

Integration causing the issue

OpenThread Border Router Addon

Link to integration documentation on our website

No response

Diagnostics information

OTBR Addon Crashes during bootup. Looks like a permissions error. This is new behaviour that I have only noticed in the past week or so.
core_openthread_border_router_2024-12-03T00-45-25.316Z.log

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Snapshot from the logs of relevance.

[NOTE]-AGENT---: Running 0.3.0-b041fa52-dirty
[NOTE]-AGENT---: Thread version: 1.3.0
[NOTE]-AGENT---: Thread interface: wpan0
[NOTE]-AGENT---: Radio URL: spinel+hdlc+uart:///dev/ttyUSB1?uart-baudrate=460800&uart-flow-control
[NOTE]-AGENT---: Radio URL: trel://enp4s0
[NOTE]-ILS-----: Infra link selected: enp4s0
49d.17:04:27.890 [C] P-SpinelDrive-: Software reset co-processor successfully
49d.17:04:27.911 [C] Platform------: platformConfigureTunDevice() at netif.cpp:2022: Operation not permitted
[11:44:45] WARNING: �[33motbr-agent exited with code 5 (by signal 0).�[0m

Additional information

No response

@josh-blake josh-blake changed the title OTBR-Agent Crashing OTBR Addon Crashing Dec 3, 2024
@kelddamsbo
Copy link

Have the exact same problem, also on Supervised

@bieskholodov
Copy link

bieskholodov commented Dec 3, 2024

The only problem is Supervised
confirmation. Docker and OS are working

@bieskholodov
Copy link

@josh-blake
Copy link
Author

josh-blake commented Dec 3, 2024

Agreed with this - although I'm not sure that it's on the Supervised side. I spun up a stock standard openthread/otbr Docker image with the hassio network_driver and it also worked flawlessly so I'm not convinced it's a permissions thing on the Supervised side of things but rather with ingress on the most recent homeassistant/otbr image.

I lied - I restored version 2.9.0 of homeassistant/otbr from a backup and it's producing the same error so I'm now convinced it's related to new permissions / ingress requirements at the system level.

@kelddamsbo
Copy link

I had actually been running on my Supervised, but suddenly it stopped working because of this permission thing. If it was an update to HA or something else I don't know

@darkxst
Copy link
Contributor

darkxst commented Dec 4, 2024

For permission error on supervised, try adding your user to the netdev group.

@Wheemer
Copy link

Wheemer commented Dec 5, 2024

For permission error on supervised, try adding your user to the netdev group.

I just tried that from underlying debian and seems to be no change.

@kelddamsbo
Copy link

kelddamsbo commented Dec 5, 2024

I have added myself to netdev in Home Assistant terminal:
sudo adduser myname netdev
When I reboot, it's gone again, can I make it permanent ?
Without reboot, I still get the same error:
[C] Platform------: platformConfigureTunDevice() at netif.cpp:2022: Operation not permitted

Which user starts the addin ?

@bieskholodov
Copy link

For permission error on supervised, try adding your user to the netdev group.

It doesn't work

@darkxst
Copy link
Contributor

darkxst commented Dec 6, 2024

I not sure what could be causing this then. I do have a supervised HA box, however it does not suffer this permission error.

@nyok92
Copy link

nyok92 commented Dec 6, 2024

@darkxst
Hi , what os is that box ? I have Armbian/bookworm armv8

@darkxst
Copy link
Contributor

darkxst commented Dec 6, 2024

Debian 12/Bookworm (on x86_64)

@josh-blake
Copy link
Author

The netdev group does not exist in the Addon container. You can verify this by running sudo docker run -it --entrypoint bash homeassistant/amd64-addon-otbr:2.12.2 and cat-ing the /etc/group file. The netdev group has not existed on previous containers either so this idea of netdev group is a red herring.

I am not sure if anything has changed in the Apparmor profiles or between dist updates (debian has pushed to 12.8 Nov. 9) which roughly coincides with when the addon stopped working.

For reference, I am running debian 12.8 (6.1.0-22-amd64); supervised.

I have tried: reinstalling supervised, re-running docker install script, re-pulling both the HA Core container and OTBR containers without avail. I can confirm that the stock OpenThread/OTBR image spins up just fine and speaks freely between HA and my SkyConnect. I would use this setup except for the Thread Network persistence issue between container restarts.

@kelddamsbo
Copy link

49d.17:36:05.950 [C] Platform------: platformConfigureTunDevice() at netif.cpp:2022: Operation not permitted
When I do a dmesg on the undelying system Raspberry pi OS I get this after I try to start otbr:
[ 1993.262078] cp210x ttyUSB0: failed set request 0x12 status: -110
The same is in HA terminal

@kelddamsbo
Copy link

Ahhh, thats was anther error, solved by:
sudo -E rpi-eeprom-config --edit and adding PSU_MAX_CURRENT=5000
and
sudo nano /boot/firmware/config.txt and adding usb_max_current_enable=1 under [all]

@nodamnway
Copy link

nodamnway commented Dec 8, 2024

This happened to me when I upgraded containerd.io to 1.7.24-1 on Debian 12.

As a temporary solution, I downgraded to containerd.io=1.7.23-1

apt list -a containerd.io # list installed, pick the one with 1.7.23
sudo apt install containerd.io=1.7.23-1
sudo apt-mark hold containerd.io # do not upgrade it, can be skipped

@josh-blake
Copy link
Author

josh-blake commented Dec 8, 2024

Can confirm this as a solution. Just downgraded containerd.io to 1.7.23-1 and it now loads just fine. There are a number of subtle changes on this version however I'm going to bet it's related to the cgroup changes (#10814).

@kelddamsbo
Copy link

Can also confirm it, this is simply great

@bieskholodov
Copy link

sudo apt-mark hold containerd.io

Thanks, this solution works

@Sourcer63
Copy link

I restored previous HA versions and it wasn't working. Downgraded OTBR to 2.11.0 and things were back. Updating HA to 2024.12 leaves OTBR running. So something changed in 2.12.

@stephyra
Copy link

Bonjour, Je viens seulement d'installer HA et d'installer la dernière version d'OTBR. J'aimerai donc pouvoir installer à la place la version 2.11.0 car je rencontre le même problème. Je n'y connais vraiment pas grand chose mais après avoir cherché partout je ne sais pas comment forcer la version 2.11.0 vu que je n'ai jamais été sous cette version. Comment avez-vous procédé ? Y a t'il un fichier téléchargeable quelque part (je ne trouve que les dernières versions...).Merci de votre aide

@Sourcer63
Copy link

Bonjours, je réponds en anglais pourque la discussion reste compréhensible pour tout le monde.

First of all, apply the latest HA update 2024.12.2. then also update OTBR and you should be ok.

Note OTBR 2.11 wasn't working reliably with neither HA 2024.12. nor 2024.12.2. So something was non-functional between HA and OTBR.

I am confident that updating to the latest HA version will solve your issues. Unless you can't restore previous functional setup, you will need to download old versions of HA and OTBR, install them and start from scratch.

@stephyra
Copy link

Sorry I forgot i was reading a translation. Thank you for your help. My version is already HA 2024.12.2 and OTBR 2.12.2. So I guess my problem is somewhere else... Thank you again

@Sourcer63
Copy link

Hhm, what exactly is the issue? Does IT BE start and then stops again? You may want to post the OTBR log for the experts to take a look.

@stephyra
Copy link

Yes it starts and then it stops. Maybe my material isnt compatible... My OTBR log is :

[12:37:25] INFO: The otbr-web is disabled.
s6-rc: info: service mdns: starting
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service mdns successfully started
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service banner: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
s6-rc: info: service legacy-cont-init successfully started
[12:37:27] INFO: Starting mDNS Responder...
Default: mDNSResponder (Engineering Build) (Dec 3 2024 17:53:13) starting

Add-on: OpenThread Border Router
OpenThread Border Router add-on

Add-on version: 2.12.2
You are running the latest version of this add-on.
System: Armbian 24.11.1 bullseye (aarch64 / qemuarm-64)
Home Assistant Core: 2024.12.2
Home Assistant Supervisor: 2024.11.4

Please, share the above information when looking for help
or support in, e.g., GitHub, forums or the Discord chat.

s6-rc: info: service banner successfully started
s6-rc: info: service universal-silabs-flasher: starting
[12:37:37] INFO: Flashing firmware is disabled
s6-rc: info: service universal-silabs-flasher successfully started
s6-rc: info: service otbr-agent: starting
[12:37:43] INFO: Setup OTBR firewall...
[12:37:45] INFO: Starting otbr-agent...
[NOTE]-AGENT---: Running 0.3.0-b041fa52-dirty
[NOTE]-AGENT---: Thread version: 1.3.0
[NOTE]-AGENT---: Thread interface: wpan0
[NOTE]-AGENT---: Radio URL: spinel+hdlc+uart:///dev/ttyUSB0?uart-baudrate=460800&uart-init-deassert
[NOTE]-AGENT---: Radio URL: trel://eth0
[NOTE]-ILS-----: Infra link selected: eth0
49d.21:00:42.034 [C] P-SpinelDrive-: Software reset co-processor successfully
49d.21:00:42.056 [C] Platform------: platformConfigureTunDevice() at netif.cpp:2022: Operation not permitted
[12:37:45] WARNING: otbr-agent exited with code 5 (by signal 0).
Chain OTBR_FORWARD_INGRESS (0 references)
target prot opt source destination
DROP all -- anywhere anywhere PKTTYPE = unicast
DROP all -- anywhere anywhere match-set otbr-ingress-deny-src src
ACCEPT all -- anywhere anywhere match-set otbr-ingress-allow-dst dst
DROP all -- anywhere anywhere PKTTYPE = unicast
ACCEPT all -- anywhere anywhere
otbr-ingress-deny-src
otbr-ingress-deny-src-swap
otbr-ingress-allow-dst
otbr-ingress-allow-dst-swap
Chain OTBR_FORWARD_EGRESS (0 references)
target prot opt source destination
ACCEPT all -- anywhere anywhere
[12:37:45] INFO: OTBR firewall teardown completed.
s6-svlisten1: fatal: /run/s6-rc/servicedirs/otbr-agent failed permanently or its supervisor died
s6-rc: warning: unable to start service otbr-agent: command exited 1
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service universal-silabs-flasher: stopping
/run/s6/basedir/scripts/rc.init: warning: s6-rc failed to properly bring all the services up! Check your logs (in /run/uncaught-logs/current if you have in-container logging) for more information.
/run/s6/basedir/scripts/rc.init: fatal: stopping the container.
s6-rc: info: service mdns: stopping
s6-rc: info: service universal-silabs-flasher successfully stopped
Default: mDNSResponder (Engineering Build) (Dec 3 2024 17:53:13) stopping
s6-rc: info: service banner: stopping
s6-rc: info: service banner successfully stopped
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
[12:37:45] INFO: mDNS ended with exit code 4 (signal 0)...
s6-rc: info: service mdns successfully stopped

@Sourcer63
Copy link

Sourcer63 commented Dec 13, 2024

Oh, you know what? It has stopped here too! 🤔

It was working yesterday, after the HA update, as I was able to unlock one of my intelligent window handles.

The only thing I did was to update "Terminal & SSH". So the trouble starts with the HA restart. Manually relaunch of OTBR doesn't help. Same errors as yours.

@stephyra
Copy link

Its the first time i try to install all of this, so i thought i was doing something bad... Im a bit reassured to know that its note necessarily my fault 😅

@Wheemer
Copy link

Wheemer commented Dec 13, 2024

This happened to me when I upgraded containerd.io to 1.7.24-1 on Debian 12.

As a temporary solution, I downgraded to containerd.io=1.7.23-1

apt list -a containerd.io # list installed, pick the one with 1.7.23
sudo apt install containerd.io=1.7.23-1
sudo apt-mark hold containerd.io # do not upgrade it, can be skipped

I just followed these instructions and it seems to be working?

@stephyra
Copy link

i saw this instructions earlier but was not sure about that... but it seems to be ok ! Its the first time I saw it start so now, i will test it. Thank you both !!

@Sourcer63
Copy link

I just followed these instructions and it seems to be working?

It does indeed.

I didn't block it from updating though. Let's see how things develop.

@nodamnway
Copy link

nodamnway commented Dec 13, 2024

Hhm, what exactly is the issue?

49d.21:00:42.056 [C] Platform------: platformConfigureTunDevice() at netif.cpp:2022: Operation not permitted

This is the key, containerd.io somehow enforced permission checking in 1.7.24-1.

I had a similar issue in my other docker-compose service that uses OpenVPN inside, and it can be fixed there with one of the following:

  • make service privileged: true
  • or just add /dev/net/tun to the devices list for the service
devices:
  - /dev/net/tun

This worked for my docker-compose, but unfortunately I couldn't find an easy way to do the same for the OTBR container, so I downgraded to 1.7.23-1 for now, until that is fixed.

@josh-blake
Copy link
Author

The blunt-tool workaround is to disable "Protection Mode" for the container. This effectively allows unrestrained access from the container to the host system. This confirms it's a container permissions issue, likely relating to access to /dev/net/tun device and the device tree itself.

Make sure you have the OTBR addon installed and configured (ie device is selected etc.)

sudo nano /usr/share/hassio/addons.json - replace your default HA directory here. For most supervised installs, this should be it. Pick your favourite text editor here.

Scroll to the OTBR container entry and go to the "full_access": false, key-value pair and change false to true

Save the file and exit.

DO NOT DO ANYTHING MORE IN HA. Run a sudo reboot to restart your system. This will now enable a "Protection Mode" tab in your addon. Disable "Protection Mode" and start the addon. It should now load.

@Sourcer63
Copy link

Sourcer63 commented Dec 14, 2024

This will now enable a "Protection Mode" tab in your addon.

Will this survive future HA updates?

@josh-blake
Copy link
Author

josh-blake commented Dec 14, 2024

No, it will not.

There is a slightly better version of doing this; and that involves editing the config.yaml file for the addon itself. You must completely uninstall the addon first. The supervised addon installer uses a config file to define the addons.json file.

sudo nano /usr/share/hassio/addons/core/openthread_border_router/config.yaml

To this file, add: full_access: true - save, exit.

Restart your system. Reinstall the addon. This will now add the Privileged Mode access to the tab, and should survive an uninstall and reinstall of the container. However, it will not survive an update!

The issue is broader than simply this addon. As I suspected, the runc version changes has impacted containerd.io which effectively removes /dev/net/tun from CAP_NET_ADMIN. Please see these related posts: here and here

Unfortunately adding:

devices:
  - /dev/net/tun

to the addon config.yaml does NOT resolve the issue.

So at this point, the container has to be run in privileged mode to allow it to access the tun device tree. sighs

@nodamnway
Copy link

nodamnway commented Dec 14, 2024

Unfortunately adding:

devices:
  - /dev/net/tun

to the addon config.yaml does NOT resolve the issue.

So at this point, the container has to be run in privileged mode to allow it to access the tun device tree. sighs

Could you please double check? Because as for me, just adding the /dev/net/tun do devices is enough.
As you probably already know, this is the recommended solution.

What I did:

  • saved my old config for OTBR addon
  • Uninstalled OTBR addon with Also permanently delete this addon's data
  • Modified /usr/share/hassio/addons/core/openthread_border_router/config.yaml with something like
--- a/openthread_border_router/config.yaml
+++ b/openthread_border_router/config.yaml
@@ -20,6 +20,8 @@ host_uts: true
 privileged:
   - IPC_LOCK
   - NET_ADMIN
+devices:
+  - /dev/net/tun  
 image: homeassistant/{arch}-addon-otbr
 init: false
 options:
  • Went to the Add-on Store, checked for updates via 3-dot menu
  • installed OTBR addon again, restored config and ran it

After that, it just worked (on Debian 12, HA Supervised, SLZB-06m as border router).

@Sourcer63
Copy link

You need a certain tech level to apply such changes., e.g. where is the OTBR config info stored. Not exactly what the average HA user is able to do.

Keeping containerid.io at 1.7.23 and blocking it from upgrading seems both easier and safer.

@darkxst
Copy link
Contributor

darkxst commented Dec 15, 2024

  • Modified /usr/share/hassio/addons/core/openthread_border_router/config.yaml

Dont modify the installed core addon, instead copy /usr/share/hassio/addons/core/openthread_border_router/ folder to /usr/share/hassio/addons/local/openthread_border_router

edit config.yaml

  • remove the images: line
  • add any other edits you want to the configuration
  • save

Now go to the Addon Store, and you should have a "Local" Openthread Border router addon that you can install (You may need to force refresh cache if it doesnt show up)

https://developers.home-assistant.io/docs/add-ons/testing/ (only the path differs on supervised installed)

@josh-blake
Copy link
Author

Unfortunately adding:

devices:
  - /dev/net/tun

to the addon config.yaml does NOT resolve the issue.
So at this point, the container has to be run in privileged mode to allow it to access the tun device tree. sighs

Could you please double check? Because as for me, just adding the /dev/net/tun do devices is enough. As you probably already know, this is the recommended solution.

What I did:

  • saved my old config for OTBR addon
  • Uninstalled OTBR addon with Also permanently delete this addon's data
  • Modified /usr/share/hassio/addons/core/openthread_border_router/config.yaml with something like
--- a/openthread_border_router/config.yaml
+++ b/openthread_border_router/config.yaml
@@ -20,6 +20,8 @@ host_uts: true
 privileged:
   - IPC_LOCK
   - NET_ADMIN
+devices:
+  - /dev/net/tun  
 image: homeassistant/{arch}-addon-otbr
 init: false
 options:
  • Went to the Add-on Store, checked for updates via 3-dot menu
  • installed OTBR addon again, restored config and ran it

After that, it just worked (on Debian 12, HA Supervised, SLZB-06m as border router).

Yes - I thought I was losing my mind here for a second - as I tried it and the container loaded (I did this last evening too although restarted before trialling). Adding /dev/net/tun in the config.yaml file only seems to work transiently; I think it's cause supervisor is loading an old image (ie with privileged access) or permission set rather than something inherent to /dev/net/tun. This doesn't seem to survive a system restart. Can you confirm this?

I can reliably have the addon load when privilege mode is enabled for the container instead.

@darkxst
Copy link
Contributor

darkxst commented Dec 15, 2024

I think it's cause supervisor is loading an old image (ie with privileged access) or permission set rather than something inherent to /dev/net/tun. This doesn't seem to survive a system restart. Can you confirm this?

editing the core addon, it could still be downloading the docker image instead of rebuilding the addon. Follow my steps above to install as local addon (with the image: line removed to avoid pulling container). Does it work then?

@josh-blake
Copy link
Author

I think it's cause supervisor is loading an old image (ie with privileged access) or permission set rather than something inherent to /dev/net/tun. This doesn't seem to survive a system restart. Can you confirm this?

editing the core addon, it could still be downloading the docker image instead of rebuilding the addon. Follow my steps above to install as local addon (with the image: line removed to avoid pulling container). Does it work then?

This does not make a difference as both a local and addon version are based on the same docker image (an Alpine image with the same dockerbuild file); it's the permissions set that the container is loaded with that dictates how it will behave - and this is called from both the config.yaml or build.yaml. Spinning off a local copy will simply limit interference from any version pushes or system upgrades (which may be a solution for some) - for others, simply unticking the Auto update option is probably sufficient.

@nodamnway
Copy link

This doesn't seem to survive a system restart. Can you confirm this?

I cannot confirm this, it has survived several reboots and HA core upgrade on my machine. I guess it's only overwritten by an OTBR addon update.

editing the core addon, it could still be downloading the docker image instead of rebuilding the addon. Follow my steps above to install as local addon (with the image: line removed to avoid pulling container). Does it work then?

In my case it doesn't matter, the downloaded docker image is fine, there is no need to rebuild it, so I didn't bother with local copy of the addon (as any changes are easily recoverable from github sources).
The only changes that I needed were in the config.yaml, to change with which arguments container is started with.

@CoenWarmer
Copy link

CoenWarmer commented Jan 7, 2025

I have tried modifying /usr/share/hassio/addons/core/openthread-border-router/config.yaml by adding the devices: dev/net/tun block, and I’ve tried creating a local copy of the add-on and rebuilding the image by copying the folder into /usr/share/hassio/addons/local and modifying the config.yaml file there (removed the image: value, added the tun block as well).

After installing and configuring, both return the same error:

(49d.21:00:42.056 [C] Platform------: platformConfigureTunDevice() at netif.cpp:2022: Operation not permitted

I have restarted Home Assistant. Should I restart the machine as well in order for this to work?

@Wheemer
Copy link

Wheemer commented Jan 9, 2025

When will the problem be actually repaired so we do not need to hack things to have a working system?

@agners
Copy link
Member

agners commented Jan 13, 2025

Sorry for noticing this late. Since this is a Home Assistant Add-on issue, this should actually be an issue reported at https://github.com/home-assistant/addons/issues.

This actually got resolved with the containerd.io Debian package 1.7.25-1. Just in case containerd reverts back to not adding support I've also merged the add-on change to explicitly add permissions for the tun device (see home-assistant/addons#3864). Thanks for digging out the details of why this happened on Supervisor and the fix!

@agners agners closed this as completed Jan 13, 2025
@Wheemer
Copy link

Wheemer commented Jan 14, 2025

I just updated to the latest version and there is no change. It still crashes with the same error.

"50d.22:11:56.846 [C] Platform------: platformConfigureTunDevice() at netif.cpp:2022: Operation not permitted
[03:37:41] WARNING: otbr-agent exited with code 5 (by signal 0)."

EDIT: Apologies, I just updated the system to pull the latest containerd and it's working. Thanks!!

@agners
Copy link
Member

agners commented Jan 14, 2025

Uh, both, the OTBR update and the latest containerd should solve the problem. I am a bit surprised that you still saw the issue after updating the OTBR. In my tests, containerd.io 1.7.24-1 and the OTBR 2.12.4 did work here 🤔

@Wheemer
Copy link

Wheemer commented Jan 14, 2025

Not sure how I can help, but either way the issue does seem resolved. I don't think people can expect things to continue working unless they install the updates.

@agners
Copy link
Member

agners commented Jan 14, 2025

I've looked a bit closer again with my setup, and I realized that when Supervisor gets started without the tun module loaded, it ignores the devices configuration:

2025-01-14 12:54:24.849 DEBUG (MainThread) [supervisor.docker.addon] Ignore static device path /dev/net/tun

So essentially, loading the tun module on startup before starting Supervisor would probably make things work as well.

This seems to be limited to Supervised, since Home Assistant OS does not configure it as a module (CONFIG_TUN=y). It all gets quite theoretical, especially since containerd.io 1.7.25-1 anyways adds permission by default again. However, I might still fix this on Supervisor side so that containerd.io 1.7.24-1 would work as well, or any containerd.io version which decides to drop tun by default again.

@Wheemer
Copy link

Wheemer commented Jan 14, 2025

Thanks so much for your commitment to the community, and your foresight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.