-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent)
and requires power cycle
#2667
Comments
Reply.Error (expected Reply.RebootImminent)
Reply.Error (expected Reply.RebootImminent)
and corrupts its firmware
Reply.Error (expected Reply.RebootImminent)
and corrupts its firmwareReply.Error (expected Reply.RebootImminent)
and requires power cycle
If the binary is corrupted/incorrect, it should still give a
It looks me that only satellite 1 does not recognize any remote It is also possible that the master did not acquire an DRTIO link with satellite 1, please provide us UART logs for all coredevices to check.
This works for me when in the same setting, and an expected |
The
|
After power cycle:
Attempting to flash satellite 2:
Afterwards:
I will check the UART logs next. |
Actually, first I tested rebooting all 3 Kasli SoCs, one after the other, and just rebooting any of the devices causes errors on other devices, see issue #2668 . I think there might be a problem with the programming of I will check the UART logs now. |
@occheung Thank you so much for looking into it! If I understand you correctly, then the SFP connection between master and satellites implements the error detection method cyclic redundancy check, but it does react with appropriate log messages to CRC errors and it does not implement any kind of error correction scheme. When a transmission error happens, it simply happens and one has to try transmitting again. Did I understand you correctly? P.S. How do I write arbitrary files to the SD card's storage using |
Auxillary transmission has a CRC check per packet (< 1KB in size). It does not look triggered. We will investigate this further. You may use |
@occheung Great! I am re-compiling with
Does that mean that no errors occurred or that errors are corrected? Could you please clarify if there is an error correction scheme in place? |
No errors were thrown (at least not in the beginning), but the variant names in the logs did not change:
|
I tried to re-compile our firmware with My goal here was to have a clearly different firmware version with It seems that I will have to wait for a fix to https://git.m-labs.hk/M-Labs/artiq-zynq/issues/356 before I can perform the test. @occheung Are you absolutely sure that |
@occheung Btw, I noticed something when
Possible explanation:
Does this make sense? |
Yes, there are no error corrections. We only have error detection. DRTIO itself CRC check when a broken-down package is transmitted to remote.
Yes. The behavior of remote coremgmt flash is:
Since there appears to have no errors nor panics from your log and descriptions, flash should indeed updated your firmware. As an additional note, there is an auto reboot step. The
Yes it will be reflected on the log. So it looks like your new issue is due to the new binary not being built? (Hence, flashing the old one. So the variant is the same.)
Your explanations have already spotted a few sources of latency. There are 2 more sources I would like to point out:
Your log already shows you a rough estimate how much time it takes to get from the bootloader to establishing DRTIO, which is already 12 seconds. For RISC-V Kasli users: Firmware is written to the flash instead. Writing the firmware to the flash likely takes way more than 10 seconds.
Indeed. In fact, there won't even be a DRTIO link. We decided to take down the link during the reflashing stage because the satellite will not be able to respond to additional DRTIO messages, and the device is supposed to be rebooted anyway. If you want the coredevice log during reflash, you can get it via UART. It looks roughly like such when successful.
Then you will see the usual expected UART log from (re)booting the device. |
Some packets that passed packet CRC corrupted as soon as it de-serialized into Packet enums. Seems only reproducible on Zynq satellites with WRPLL. No such issue observed on RISC-V WRPLL satellites. |
Bug Report
crashed with
Afterwards, all commands were crashing on that satellite, including
artiq_coremgmt -D 192.168.1.30 -s 1 log
. Everything worked again after a power-cycle but the log command revealed that the variant had not been upgraded, i.e. the flash had failed.Expected Behavior
Artiq beta-manual: Writing the flash clearly states
so I would expect
to succeed, provided that
.
contains aboot.bin
file.System info
Steps to reproduce
boot.bin
files for all 3 Kasli SoCs.boot.bin
files and ranto flash the satellite at
-s 1
, which is connected to the master'sSFP0
.Further details from afterwards
I tried to flash the master. These commands all succeeded and the log revealed the new variant name:
Afterwards, satellite 2 (untouched) still reacted correctly to
artiq_coremgmt -D 192.168.1.30 -s 2 log
but satellite 1 (which I had tried to flash) crashedartiq_coremgmt -D 192.168.1.30 -s 1 log
with:I tried to reboot satellite 1 (which I had tried to flash) with
artiq_coremgmt -D 192.168.1.30 -s 1 reboot
but it crashed with:I tried satellite 1's log again, but
artiq_coremgmt -D 192.168.1.30 -s 1 log
still crashed with:After a power cycle, satellite 1 worked again but the log command revealed the old variant name, not the new, desired one.
The text was updated successfully, but these errors were encountered: