You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the rtnl listener callback I setup in a ucode script would appear to randomly "die", without any error message and while leaving the rest of the script operating normally.
After a bit of digging I think I have tracked it down to the point where it seems to be a resource exhaustion of some sort: the bug can be reproduced using the attached ucode script, which sets up a simple listener on RTNLGRP_NEIGH that prints the received messages.
Everything goes well until the neigh garbage collector kicks in and deletes a large number of neigh entries, resulting in a "large" (hundreds) number of messages being delivered. The script will typically appear to hang after printing anywhere between 0 and the first few of the delete messages ("cmd": 29), with no error what so ever.
(values fairly typical for a busy router), the garbage collector may delete hundreds of entries in one go when it kicks in (when more than 512 entries have been created), triggering the hang. I have not been able to reliably reproduce this bug when thresh1 is set to e.g. 128, which typically results GC kicking more frequently and in only a few dozen entries being pruned at once on a typical GC run, so the problem only seems to occur when a certain threshold number of messages occur "at once".
I provide a memdump of the script taken after the hang.
Provided that the "netcat" package is installed, that the LAN IP is 192.168.1.1 and almost no client devices are present, the following script will trigger the bug:
#!/bin/sh
sysctl -w net.ipv4.neigh.default.gc_thresh1=512
sysctl -w net.ipv4.neigh.default.gc_thresh2=2048
sysctl -w net.ipv4.neigh.default.gc_thresh3=4096
foriin$(seq 2 254);doecho""| netcat -c -u 192.168.1.$i 65534 # create a large number of NUD FAILED neighboursdone
sleep 5
sysctl -w net.ipv4.neigh.default.gc_thresh1=128
I noticed that the rtnl listener callback I setup in a ucode script would appear to randomly "die", without any error message and while leaving the rest of the script operating normally.
After a bit of digging I think I have tracked it down to the point where it seems to be a resource exhaustion of some sort: the bug can be reproduced using the attached ucode script, which sets up a simple listener on
RTNLGRP_NEIGH
that prints the received messages.Everything goes well until the neigh garbage collector kicks in and deletes a large number of neigh entries, resulting in a "large" (hundreds) number of messages being delivered. The script will typically appear to hang after printing anywhere between 0 and the first few of the delete messages (
"cmd": 29
), with no error what so ever.On a system where the neigh GC is set like so:
(values fairly typical for a busy router), the garbage collector may delete hundreds of entries in one go when it kicks in (when more than 512 entries have been created), triggering the hang. I have not been able to reliably reproduce this bug when
thresh1
is set to e.g. 128, which typically results GC kicking more frequently and in only a few dozen entries being pruned at once on a typical GC run, so the problem only seems to occur when a certain threshold number of messages occur "at once".I provide a memdump of the script taken after the hang.
rtnlbug.uc.txt
ucode.1703872887.23407.memdump.txt
The text was updated successfully, but these errors were encountered: