-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frequent Reboots under 2020.1 with ar71xx-* #1982
Comments
Can you acquire a crashtrace from the routers console? Otherwise it's pure guesswork what is going on here. |
I can confirm more frequent reboots since v2020.1. https://stats.darmstadt.freifunk.net/dashboard/snapshot/cb8TAPiqn16keAicL3eVp32z7hyFXlIB?orgId=1 |
This comment has been minimized.
This comment has been minimized.
We (Braunschweig) are seeing this as well. It's happening on mostly idle nodes with a few batman neighbors (~3) as well. I've got only one Oops captured so far, but will try collect more:
The trace location were decoded using addr2line. |
Here are some more crashes.txt and the corresponding vmlinux.gz. |
Do you have an idea what triggers this trace? |
None so far. :/ It is very random (every few days to a few every day).
The traces all seem to touch fs/open.c, so it's not completely random... |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
An Crashlog from FFM (TL-WR1043N/ND v4, actually on master gluon-v2020.1-73-gc5f43add):
|
Arrrg ... my node uptime-monitoring failed, so yes we are effected as well:
|
This is the same spot we're hitting. |
Looking at the Reboot statistics of Freifunk Karlsruhe, I also see a lot of reboots for devices in the ath79-generic target. (devolo wifi pro 1200e and devolo wifi pro 1200i) Ill try to obtain a crash log as soon as possible from a device in the ath79-generic target. |
I see this Problem on TP-Link TL-WR940N v4 which was upgraded from 2019.1.1 to 2020.1. |
Here a couple I have access to: --EDIT |
@kpanic23 Do you still have the corresponding vmlinux, so you could use addr2line to decode the addresses? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I have no idea why I'm missing the line numbers. I've attached my vmlinux file to my prior post (#1982 (comment)), maybe someone else can have a look? |
Your vmlinux doesn't contain debug info. Did you use We have the following change in our gluon repo: diff --git a/targets/generic b/targets/generic
index 65982ef4bfbd..9b34563e5d54 100644
--- a/targets/generic
+++ b/targets/generic
@@ -44,6 +44,8 @@ config 'CONFIG_PACKAGE_ATH_DEBUG=y'
config '# CONFIG_KERNEL_IP_MROUTE is not set'
config '# CONFIG_KERNEL_IPV6_MROUTE is not set'
+config 'CONFIG_COLLECT_KERNEL_DEBUG=y'
+
try_config 'CONFIG_TARGET_MULTI_PROFILE=y'
try_config 'CONFIG_TARGET_PER_DEVICE_ROOTFS=y' This produces an |
Using only the symbols, I found: 841v9-crashlog.txt
841v10-crashlog.txt
unifi-crashlog.txt
|
We have disabled uhttpd on a node as @blocktrron suggested, but it doesn't really seem to help: https://freifunk.fail/d/000000002/nodes?orgId=1&var-name=60487-FrickelFritze&from=now-7d&to=now |
This crashes an AC Mesh reproducible after around 1 - 2 million iterations (1-2 minutes).
So this might be helpful to determine whether something fixed the issue or not. |
@blocktrron what is the best way to compile this? |
Download the OpenWrt 19.07 SDK for ar71xx and compile it using the GCC from |
Aforementioned snippet still seems to crash ar71xx when using next. However we passed >15 million iterations on a ath79 board without any issues so far (Kernel 4.19). Will try to get a crashlog tomorrow. |
@blocktrron Is /lib/firmware/wireless a path that should exist? On an Unifi AC Lite I have only these files/folders.
|
Got it to crash on a Ubiquiti Unifi AC Lite with blocktrons code. 1.919.677 iterations.
|
Produced an other crash on a TP-Link TL-WR842v3. |
You should set CROSS_COMPILE to a sensible value when using the decode script, dumping MIPS machine code as x86-64 assembly is not very helpful (only affects the assembly dump part of the log though). |
I decoded the traces again and used openwrt-sdk-19.07.2-ar71xx-generic_gcc-7.5.0_musl.Linux-x86_64 in CROSS_COMPILE. crash1-unifi-ac-lite-decoded2.txt |
So I've had a short look at this issue, and it seems the trap occurs exactly on the return from |
Well, this is just great. From the MIPS 24K manual:
I think it is likely that the error actually occurs in |
I recommend using this variant of @blocktrron's test program (only one line of output every 2^16 loop cycles, allowing it to run much faster): #include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
char buf[255];
char path[] = "/lib/firmware/wireless";
int main(int argc, char *argv[])
{
for (unsigned i = 0; 1; i++) {
readlink(path, buf, 255);
if (i % 65536 == 0)
printf("%08x\n", i);
}
return 0;
} I have also found a potential fix: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/slub.c?id=0882ff9190e3bc51e2d78c3aadd7c690eeaa91d5 Backporting this patch from Linux 4.19 makes the issue disappear on my TL-WR841N v9 (running for ~300M loop cycles so far). I'll open a PR, so the change can get a bit more testing. |
This patch fixes a regression introduced in kernel v4.14. While the commit message only mentions a performance penalty, the issue is suspected to be the cause of spurious data bus errors on MIPS CPUs (ar71xx target). Fixes: #1982
This patch fixes a regression introduced in kernel v4.14. While the commit message only mentions a performance penalty, the issue is suspected to be the cause of spurious data bus errors on MIPS CPUs (ar71xx target). Fixes: #1982
maybe part of these reboots are also due to #2032 |
In Frankfurt, there are reports of nodes often rebooting since 2020.1.
Some nodes are collected in https://md.margau.net/ffffm-nodereboot, and it seems that the problem is related to ar71xx-*. Our stats, e.g. https://freifunk.fail/d/000000002/nodes?orgId=1&from=now-90d&to=now&var-name=ORPLID-SEE, suggest that it not an memory leak.
Bug report
Nodes are frequently rebooting without any visible cause.
What is the expected behaviour?
Nodes should not reboot. Was no problem before 2020.1
Gluon Version:
Tag 2020.1 and Tag 2020.1.1
Site Configuration:
https://github.com/freifunk-ffm/site-ffffm
Custom patches:
The text was updated successfully, but these errors were encountered: