Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ath79-generic (e.g. WR1043ND v4) - WLAN Mesh broken when upgrading to v2022.1.x because of timing issue in boot process #2779

Open
rotanid opened this issue Feb 4, 2023 · 23 comments
Labels

Comments

@rotanid
Copy link
Member

rotanid commented Feb 4, 2023

When upgrading from Gluon v2021.1.x to v2022.1.x wlan mesh doesn't work anymore on a TP-Link TL-WR1043ND v4.
The 802.11s mesh interface is shown in "iwinfo" but not on the status page or "batctl if"
The upgrade process was tested from latest v2021.1.x branch (fresh install) to Gluon v2022.1, v2022.1.1 and v2022.1.2.

The problem does not appear when flashing with "forget settings" and reconfiguring the v2022.1.x firmware from scratch.
The problem does not appear with WR1043ND v2 or v3:

A problem with the migration from ar71xx-generic to ath79-generic may have happened, although @AiyionPrime stated in #2431 that everything was working fine - so maybe the issue was introduced later than v2022.1(.0) ?

@AiyionPrime
Copy link
Member

AiyionPrime commented Feb 4, 2023

This is a v4, that's been active and updated for the last four years: https://hannover.freifunk.net/karte/#/en/map/8416f99bd2d0

With vH31 it is currently running gluon v2022.1.1 (ab1fb05).

Meshing does work and can be seen without problems on it's statuspage:
http://[2001:678:978:213:8616:f9ff:fe9b:d2d0]/cgi-bin/status

Though I certainly might have overlooked something, I think the migration was fine.
Our sample size is fairly limited though; most of the routers (75) are still on vH25 (ar71xx)
and only three or four are ath79.

@rotanid
Copy link
Member Author

rotanid commented Feb 4, 2023

tom reported on IRC that in darmstadt there was a similar issue with a TL-WR1043N v5:
https://forum.darmstadt.freifunk.net/t/meshausfall-nach-2-6-0-in-dieburger-innenstadt/944

@rotanid rotanid added the 3. topic: hardware Topic: Hardware Support label Feb 5, 2023
@rotanid
Copy link
Member Author

rotanid commented Feb 6, 2023

i just tested the tag v2022.1 and the issue also happens with the initial release of this branch.

@AiyionPrime can you be sure, the linked device was never reconfigured manually? one probably cannot see this from the available data.

i'm currently trying to get my hands on a TL-WR1043N v5 to check if i can reproduce it.

it would also help to check it on a second TL-WR1043ND v4 - does anyone have it laying around and can test?

@AiyionPrime
Copy link
Member

AiyionPrime commented Feb 6, 2023

No, I can not. We can send the owner an email and ask though, if that helps.

@Djfe
Copy link
Contributor

Djfe commented Feb 6, 2023

Just some ideas. If they don't apply, then pls bear with me :)
@rotanid Have you taken a look at what logread returns? (Maybe there are errors in there?)
Once anyone gets their hands on an affected device: Is this reproducible on a fresh install? (Install 2021, set a few things in config mode, boot once, then update to 2022)

@rotanid
Copy link
Member Author

rotanid commented Feb 7, 2023

Have you taken a look at what logread returns? (Maybe there are errors in there?)

serious question? well ok: no relevant info there as far as i can see.

Once anyone gets their hands on an affected device: Is this reproducible on a fresh install? (Install 2021, set a few things in config mode, boot once, then update to 2022)

that i have already done and also written in my bug report, i added "(fresh install)" now as it seems it hasn't been clear enough.

@Djfe
Copy link
Contributor

Djfe commented Feb 7, 2023

ok my first suggestions were very basic. 😑😅

You probably figured this out, but maybe you can replicate what they did in Darmstadt and compare configs before and after saving config mode:
install 2021 fresh
config mode 2021, save
update 2021 to 2022
get relevant info (uci show, /etc/config/wireless, ...)
config mode 2022, save
get relevant info (uci show, /etc/config/wireless, ...) again
compare the two

@Djfe
Copy link
Contributor

Djfe commented Feb 7, 2023

The only commit, that happened after adding the device would be
openwrt/openwrt@e826b64

Looking back at the old definition:
https://github.com/openwrt/openwrt/blob/openwrt-19.07/target/linux/ar71xx/files/arch/mips/ath79/mach-tl-wr1043nd-v4.c#L62-L67
absolute MAC offsets on flash are v4 0x1ff50008 vs v5 0x1ff00008
It's obvious that mac's are placed in the same partition but the partition has a different offset.
In the commit the definition is moved from the dtsi down to the correct dts files.
This should've improved from upgrading the old ar71xx target, but maybe the commit broke something else.
They also moved the calibration data for wifi along.

Maybe I found the issue?

#define TL_WR1043_V4_EEPROM_ADDR		0x1fff0000
#define TL_WR1043_V4_WMAC_CALDATA_OFFSET	0x1000

Old address of calibration data 0x1fff1000

New definition based on art partition, but art != eeprom(?)

art: partition@ff0000
mtd-cal-data = <&art 0x1000>;

New address of calibration data 0x1ff01000 (I added 0x1 manually, I assume it is calculated this way)

Possible fix?
mtd-cal-data = <&art 0xf1000>;

@rotanid
Copy link
Member Author

rotanid commented Feb 7, 2023

the only difference in config i could find is as follows:

> config interface 'mesh_radio0'
> 	option proto 'gluon_mesh'
> 

looking at the code, there may be an issue with "get_wlan_mac" during the upgrade from 2021.1.x and therefore the upgrade script returns before setting up the above section.
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/200-wireless#L129

this correlates with what @Djfe found in OpenWrt.

anyone else can follow this argumentation? @blocktrron @NeoRaider @adschm ?

@Djfe
Copy link
Contributor

Djfe commented Feb 8, 2023

I feel like we should print an error when there is no wmac to be found (Lines 130-131).
This error could happen again for other devices and can be catched best by observing some form of log.
Since logread is probably no option for init scripts, could we create a script that writes a log to flash that is only overwritten the next time, init scripts are run? (yes, there are devices with small flash storage, so we either have to keep the log small and pipe it through gzip on write after init complete, or we disable this for tiny style targets)

Such a log could be useful for adding new devices, too.
It also allows catching mistakes in new code/dts files regarding the initialization.
It would be useful alone for all silent returns in the lua file @rotanid linked above.

@rotanid rotanid changed the title WR1043ND v4 - WLAN Mesh broken when upgrading to v2022.1.x WR1043ND v4 WR1043N v5 - WLAN Mesh broken when upgrading to v2022.1.x Feb 9, 2023
@rotanid
Copy link
Member Author

rotanid commented Feb 9, 2023

i bought a TL-WR1043N v5 and tested it.
this device has exactly the same problem :-(
it was therefore erroneously tested as "working" in #2483

@rotanid
Copy link
Member Author

rotanid commented Feb 9, 2023

Possible fix? mtd-cal-data = <&art 0xf1000>;

@Djfe i tested this, it doesn't work and instead soft-bricks the device on upgrade

@rotanid
Copy link
Member Author

rotanid commented Feb 13, 2023

after looking a bit into it with the help of rmilecki from openwrt it seems like the issue may not be in the OpenWrt dts ...

@rotanid
Copy link
Member Author

rotanid commented Feb 15, 2023

after a discussion in today's Gluon meetup we want to debug the band migration also:
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/200-wireless#L211
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/005-wireless-migration#L10
An idea was, to add "prints" or output the debug to some persistent file to find out which line breaks the upgrade scripts.

@lemoer lemoer added this to the 2022.1.3 milestone Feb 15, 2023
@rotanid
Copy link
Member Author

rotanid commented Feb 19, 2023

so after many hours i'm closer to the problem - without a solution.

during first boot after upgrade when the upgrade scripts run, in 200-wireless the call to get_htmode fails and therefore no config update (the lines after) is written:
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/200-wireless#L198
get_htmode fails in this line when trying to find the phy
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/200-wireless#L80
find_phy from wireless.lua fails here:
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/usr/lib/lua/gluon/wireless.lua#L56
this is the call to find_phy_by_path.
during this first boot-upgrade-run the path is set to platform/qca956x_wmac but there is no path containing qca956x_wmac in /sys/devices/platform , therefore, neither of the following lines in find_phy_by_path can find any phy.
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/usr/lib/lua/gluon/wireless.lua#L22
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/usr/lib/lua/gluon/wireless.lua#L27

the actual path of the phy would be /sys/devices/platform/ahb/18100000.wmac (at least on TL-WR1043ND v4) and later this path is correctly set.
therefore, an additional run of the 200-wiress script fixes the problem.

seems to me like a timing issue during first boot after the sysupgrade.
anyone with further ideas for debugging/fixing , e.g. @blocktrron @NeoRaider ?

@rotanid
Copy link
Member Author

rotanid commented Feb 20, 2023

after talking about it with @NeoRaider on IRC we found out that it may be a timing issue.
in rare cases the hotplug.d scripts seem to be run too late in procd context and therefore an important migration for ar71xx->ath79 is missing when gluon's upgrade scripts run.
this is the openwrt migration script:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=target/linux/ath79/base-files/etc/hotplug.d/ieee80211/00-wifi-migration;h=f7393a0d0371bab38a70a7fdb93d558689c5c074;hb=refs/heads/openwrt-22.03
this should be run be procd.
the upgrade scripts are run by uci-defaults and this call starts it:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/base-files/files/etc/init.d/boot;h=749d9e971141c63542e220bbd5c175f40041b174;hb=refs/heads/openwrt-22.03#l50

i verified the theory by adding a 5 second delay in one of gluon's first upgrade scripts here:
https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/005-wireless-migration#L2

with that sleep-hack, the upgrade works fine!

so the already existing hack in OpenWrt seems to be too little:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/base-files/files/etc/init.d/boot;h=749d9e971141c63542e220bbd5c175f40041b174;hb=refs/heads/openwrt-22.03#l46

it would be nice to find a solution that doesn't depend on timing but is deterministic... maybe @NeoRaider comes up with an idea, otherwise we might need to add some seconds of sleep in Gluon

@rotanid rotanid changed the title WR1043ND v4 WR1043N v5 - WLAN Mesh broken when upgrading to v2022.1.x ath79-generic (e.g. WR1043ND v4) - WLAN Mesh broken when upgrading to v2022.1.x because of timing issue in boot process Feb 20, 2023
rotanid added a commit that referenced this issue Feb 20, 2023
…initialisations

workaround for a timing issue during first boot on ath79-generic
after sysupgrade from ar71xx-generic image

GitHub Issue: #2779
@rotanid
Copy link
Member Author

rotanid commented Feb 20, 2023

i created a pull request for the workaround: #2792

this issue stays as long as we have no deterministic fix

rotanid added a commit that referenced this issue Feb 25, 2023
wait for device initialisations
workaround for a timing issue during first boot on ath79-generic
after sysupgrade from ar71xx-generic image

GitHub Issue: #2779
rotanid added a commit that referenced this issue Feb 25, 2023
wait for device initialisations
workaround for a timing issue during first boot on ath79-generic
after sysupgrade from ar71xx-generic image

GitHub Issue: #2779
github-actions bot pushed a commit that referenced this issue Feb 25, 2023
wait for device initialisations
workaround for a timing issue during first boot on ath79-generic
after sysupgrade from ar71xx-generic image

GitHub Issue: #2779

(cherry picked from commit d97673f)
@rotanid rotanid removed this from the 2022.1.3 milestone Feb 25, 2023
@rotanid
Copy link
Member Author

rotanid commented Feb 25, 2023

removing the issue from the milestones as workarounds have been implemented.

@adschm
Copy link
Contributor

adschm commented Mar 5, 2023

If I remember correctly, the wifi startup somehow happens asynchronously and you simply cannot depend on it during procd startup. That's why you have to rely on these hotplug.d scripts if you want to configure anything after they have come up. But I might be wrong, it's been a while since I dealt with this stuff.

JayBraker pushed a commit to JayBraker/gluon that referenced this issue Apr 12, 2023
wait for device initialisations
workaround for a timing issue during first boot on ath79-generic
after sysupgrade from ar71xx-generic image

GitHub Issue: freifunk-gluon#2779
@Djfe
Copy link
Contributor

Djfe commented Sep 13, 2023

should we revert this commit now?
gluon master doesn't support upgrading from v2021.1.x any longer atm. (the bridges were burned unless anyone wants to step up and keep them maintained for the v2023.2 release)

@neocturne
Copy link
Member

@Djfe I don't think there is anything specific to the update from 2021.1.x to 2022.1.x to this issue, it could easily occur on any upgrade that requires updating the wireless UCI config.

@rotanid
Copy link
Member Author

rotanid commented Nov 7, 2023

@rotanid
Copy link
Member Author

rotanid commented Nov 9, 2023

forget the above "fix", because jow wrote on IRC:

10:27:33 < jow > rotanid: well there is ieee80211 hotplug events which work for that. I think the reason for this particular hack is the fact that there's
uci-defaults scripts which want to mangle the default wifi config
10:27:42 < jow > rotanid: and those uci-defaults script run very early
10:28:14 < jow > rotanid: a proper solution would be moving whatever logic is needed from uci-defaults into the wifi reconf code path

maybe someone has an idea how to implement this in order to replace the sleep-Hack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants