Flashing a satellite Kasli SoC crashes with `Reply.Error (expected Reply.RebootImminent)` and requires power cycle #2667

dimitsev · 2025-02-03T14:12:48Z

Bug Report

artiq_coremgmt -D 192.168.1.30 -s 1 flash .

crashed with

Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 196, in main
    mgmt.flash(bins)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 221, in flash
    self._read_expect(Reply.RebootImminent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.RebootImminent)

Afterwards, all commands were crashing on that satellite, including artiq_coremgmt -D 192.168.1.30 -s 1 log. Everything worked again after a power-cycle but the log command revealed that the variant had not been upgraded, i.e. the flash had failed.

Expected Behavior

Artiq beta-manual: Writing the flash clearly states

artiq_coremgmt [-D IP_address] -s <destination_number> flash <afws_directory>

so I would expect

artiq_coremgmt -D 192.168.1.30 -s 1 flash .

to succeed, provided that . contains a boot.bin file.

System info

We have 3x Kasli SoC v1.1.1: two satellites connected to one master in default topology.
All three Kasli SoCs are running artiq-zynq commit a0281e498927d5c8467dd18f28aa497cb75150fd from 23 December 2024.
Our host computer is running artiq commit d4bd4e587958a3ea249a7d4759d2ddf144d51a85 from 31 January 2025.

Steps to reproduce

We compiled artiq-zynq commit 63157588bb8ed214bc9f24ade2d09b4ddefada03 from 15 January 2025 into new boot.bin files for all 3 Kasli SoCs.
Then, we entered the directory of one of the new boot.bin files and ran

artiq_coremgmt -D 192.168.1.30 -s 1 flash .

to flash the satellite at -s 1, which is connected to the master's SFP0.

Further details from afterwards

I tried to flash the master. These commands all succeeded and the log revealed the new variant name:

artiq_coremgmt -D 192.168.1.30 flash .
artiq_coremgmt -D 192.168.1.30 reboot
artiq_coremgmt -D 192.168.1.30 log

Afterwards, satellite 2 (untouched) still reacted correctly to artiq_coremgmt -D 192.168.1.30 -s 2 log but satellite 1 (which I had tried to flash) crashed artiq_coremgmt -D 192.168.1.30 -s 1 log with:

Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)

I tried to reboot satellite 1 (which I had tried to flash) with artiq_coremgmt -D 192.168.1.30 -s 1 reboot but it crashed with:

Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 199, in main
    mgmt.reboot()
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 199, in reboot
    self._read_expect(Reply.RebootImminent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.RebootImminent)

I tried satellite 1's log again, but artiq_coremgmt -D 192.168.1.30 -s 1 log still crashed with:

Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)

After a power cycle, satellite 1 worked again but the log command revealed the old variant name, not the new, desired one.

The text was updated successfully, but these errors were encountered:

occheung · 2025-02-04T02:17:33Z

OSError: Incorrect reply from device: Reply.Error (expected Reply.RebootImminent)

If the binary is corrupted/incorrect, it should still give a RebootImminent message (with error LED after failing CRC). And,

satellite 1 (which I had tried to flash) crashed artiq_coremgmt -D 192.168.1.30 -s 1 log

It looks me that only satellite 1 does not recognize any remote coremgmt commands, as coremgmt commands report correctly for satellite 2. Could you check if flashing works for satellite 2?

It is also possible that the master did not acquire an DRTIO link with satellite 1, please provide us UART logs for all coredevices to check.

artiq_coremgmt -D 192.168.1.30 -s 1 flash .

This works for me when in the same setting, and an expected ValueError when both RISC-V and Zynq binaries are detected:
ValueError: both risc-v and zynq binaries were found, please clean up your build directory.

dimitsev · 2025-02-04T05:36:21Z

@occheung

If the binary is corrupted/incorrect, it should still give a RebootImminent message (with error LED after failing CRC).

The ERR LED on the satellite Kasli SoC's front panel indeed turns on (glowing red) as soon as I call artiq_coremgmt -D 192.168.1.30 -s 1 flash .. The ERR LED only turns off again after power cycle.

It looks me that only satellite 1 does not recognize any remote coremgmt commands, as coremgmt commands report correctly for satellite 2. Could you check if flashing works for satellite 2?

I have obviously tried artiq_coremgmt -D 192.168.1.30 -s 2 flash . as well (in the directory with the appropriate binary). I have tried the flash command on both satellites multiple times, always after one or several power cycles and after verifying that the log command was working and that the start-up log was free from errors.
For both satellites, the flash command either crashes with OSError: Incorrect reply from device: Reply.Error (expected Reply.RebootImminent) or it completes without error. Which of these 2 alternatives happens is non-deterministic.
But the "flashed" satellite's ERR LED always turns on immediately after the flash command and every single command after that (such as artiq_coremgmt -D 192.168.1.30 -s 1 log or artiq_coremgmt -D 192.168.1.30 -s 2 log) always crashes (e.g. with OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)).

It is also possible that the master did not acquire an DRTIO link with satellite 1, please provide us UART logs for all coredevices to check.

If this was a non-deterministic problem with the DRTIO link, I would expect the log commands to also fail sometimes non-deterministically. However, they have never once failed on either satellite after 1-2 power cycles, i.e. after they start working again. And I called them multiple times in succession, just to check.

dimitsev · 2025-02-04T07:34:32Z

After power cycle:

~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005248s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020339s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.191967s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.200821s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.209944s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.405688s]  INFO(libboard_artiq::si549): Main Si549 started
[     0.461999s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     0.473963s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.511293s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-4e-6f IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:4e6f IPv6: no configured address
[     0.538850s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     0.549098s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     0.561875s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     0.573243s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link RX became up, pinging
[     0.603360s]  INFO(runtime::comms): no idle kernel found
[     0.979998s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] remote replied after 2 packets
[     1.013117s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link initialization completed
[     1.020848s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[     1.027478s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] destination is up
[     1.034216s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] buffer space is 128
[     4.256994s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     4.463992s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 1 packets
[     4.497206s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[     4.504959s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[     4.511649s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
[     5.073092s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    46.995288s]  INFO(runtime::mgmt): received connection
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;kasli_soc_artiq_master_20250109_drtio_satellite_wrpll_
[     0.020948s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.361131s]  INFO(libboard_artiq::si549): Main Si549 started
[     0.466864s]  INFO(satman): Switching SYS clocks...
[     0.521727s]  INFO(satman): SYS CLK switched successfully
[     0.714138s]  INFO(libboard_artiq::si549): Helper Si549 started
[     0.740633s]  INFO(satman): SED spreading disabled by default
[     0.771977s]  INFO(satman): uplink is up, switching to recovered clock
[     0.912916s]  INFO(libboard_artiq::si549::wrpll): WRPLL interrupt enabled
[     0.939693s]  INFO(libboard_artiq::si549::wrpll): testing the skew of SYS CLK...
[     0.952077s]  INFO(libboard_artiq::si549::wrpll): the skew of SYS CLK met the timing constraint
[     4.511209s]  INFO(satman): TSC loaded from uplink
[     4.524247s]  INFO(satman): rank: 1
[     4.527723s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;kasli_soc_artiq_master_20250109_drtio_satellite_wrpll_
[     0.020948s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.360959s]  INFO(libboard_artiq::si549): Main Si549 started
[     0.466690s]  INFO(satman): Switching SYS clocks...
[     0.521553s]  INFO(satman): SYS CLK switched successfully
[     0.714071s]  INFO(libboard_artiq::si549): Helper Si549 started
[     0.739640s]  INFO(satman): SED spreading disabled by default
[     0.825818s]  INFO(satman): uplink is up, switching to recovered clock
[     0.966749s]  INFO(libboard_artiq::si549::wrpll): WRPLL interrupt enabled
[     0.993526s]  INFO(libboard_artiq::si549::wrpll): testing the skew of SYS CLK...
[     1.005912s]  INFO(libboard_artiq::si549::wrpll): the skew of SYS CLK met the timing constraint
[     1.266050s]  INFO(satman): TSC loaded from uplink
[     1.279000s]  INFO(satman): rank: 1
[     1.282472s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log clear
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log clear
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log clear
~/workspace ❯

Attempting to flash satellite 2:

cd to the appropriate directory.
artiq_coremgmt -D 192.168.1.30 -s 2 flash . completes without error.

Afterwards:

~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[    62.327674s]  INFO(runtime::mgmt): received connection
[    65.090870s]  INFO(runtime::mgmt): received connection
[    77.640481s]  INFO(runtime::mgmt): received connection
[    79.344857s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link is down
[    79.351150s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] destination is down
[    87.439046s]  INFO(runtime::mgmt): received connection
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log
Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)
~/workspace ❯

I will check the UART logs next.

dimitsev · 2025-02-04T08:12:12Z

@occheung

Actually, first I tested rebooting all 3 Kasli SoCs, one after the other, and just rebooting any of the devices causes errors on other devices, see issue #2668 .

I think there might be a problem with the programming of artiq_coremgmt reboot or of the communication following it.

I will check the UART logs now.

occheung · 2025-02-04T09:00:23Z

@dimitsev

What likely happened is a CRC error due to corrupted image.
I tried remote coremgmt writing a large lipsum text (100 paragraphs) as a file to a satellite built using your JSON in the forum.
I do find discrepancy on the lipsum file on the satellite device to the original.

dimitsev · 2025-02-04T09:12:13Z

@occheung Thank you so much for looking into it!

If I understand you correctly, then the SFP connection between master and satellites implements the error detection method cyclic redundancy check, but it does react with appropriate log messages to CRC errors and it does not implement any kind of error correction scheme. When a transmission error happens, it simply happens and one has to try transmitting again.

Did I understand you correctly?

P.S. How do I write arbitrary files to the SD card's storage using artiq_coremgmt?

occheung · 2025-02-04T09:21:28Z

Auxillary transmission has a CRC check per packet (< 1KB in size). It does not look triggered. We will investigate this further.

You may use artiq_coremgmt config write -f <key> <file> to write file. -s argument is supported.
Meanwhile we do find CRC passes and able to reflash in the same variant without WRPLL, perhaps related.

dimitsev · 2025-02-04T09:58:46Z

@occheung Great! I am re-compiling with "enable_wrpll" : false now.

Meanwhile we do find CRC passes

Does that mean that no errors occurred or that errors are corrected? Could you please clarify if there is an error correction scheme in place?

dimitsev · 2025-02-04T13:38:18Z

I set "enable_wrpll" : false in our 3 system description files and re-compiled the 3 firmwares based on artiq-zynq commit 63157588bb8ed214bc9f24ade2d09b4ddefada03 from 15 January 2025.
Then, I removed the SD cards from our 3 Kasli SoCs, put them into a computer, replaced boot.bin by the new ones and put the SD cards back into the Kasli SoCs.
Then, I changed "variant" in all 3 system description files to "dio_new", "ao_new" and "dds2_new" respectively and re-compiled the exact same firmware again. I hope that this changed the "variant" field in the actual boot.bin files?
Finally, I tried to flash these firmwares with the new "variant"-values to our 3 Kasli SoCs via Ethernet.

No errors were thrown (at least not in the beginning), but the variant names in the logs did not change:

~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[     0.000066s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005247s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020333s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.171031s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.179885s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.189086s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.579770s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     7.821356s]  INFO(libboard_artiq::si5324):   ...locked
[     7.876985s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     7.888956s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     7.926323s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-4e-6f IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:4e6f IPv6: no configured address
[     7.953866s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     7.964118s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     7.976935s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     7.988538s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     8.027523s]  INFO(runtime::comms): no idle kernel found
[    12.988080s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    13.199974s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 26 packets
[    13.233256s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[    13.240985s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link RX became up, pinging
[    13.447974s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] remote replied after 1 packets
[    13.481170s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link initialization completed
[    13.488896s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[    13.495705s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[    13.502393s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
[    13.509276s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] destination is up
[    13.515960s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] buffer space is 128
[    24.824376s]  INFO(runtime::mgmt): received connection
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;crate_mqva_ao_kasli_soc_v1.1_artiq_zynq_master_6315758
[     0.020946s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534694s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     3.168833s]  INFO(libboard_artiq::si5324):   ...locked
[     3.274045s]  INFO(satman): Switching SYS clocks...
[     3.328909s]  INFO(satman): SYS CLK switched successfully
[     3.354526s]  INFO(satman): SED spreading disabled by default
[     8.016445s]  INFO(satman): uplink is up, switching to recovered clock
[     8.062459s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     9.872650s]  INFO(libboard_artiq::si5324):   ...locked
[    12.926362s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 140, width: 441 (354deg)
[    13.253797s]  INFO(satman): TSC loaded from uplink
[    13.266820s]  INFO(satman): rank: 1
[    13.270289s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.005951s]  INFO(satman): gateware ident 9+unknown;crate_mqva_dds2_kasli_soc_v1.1_artiq_zynq_master_63157
[     0.020932s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534756s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.476197s]  INFO(libboard_artiq::si5324):   ...locked
[     2.581329s]  INFO(satman): Switching SYS clocks...
[     2.636193s]  INFO(satman): SYS CLK switched successfully
[     2.661519s]  INFO(satman): SED spreading disabled by default
[     8.166696s]  INFO(satman): uplink is up, switching to recovered clock
[     8.212716s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    10.000691s]  INFO(libboard_artiq::si5324):   ...locked
[    13.076421s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 144, width: 441 (354deg)
[    13.710264s]  INFO(satman): TSC loaded from uplink
[    13.723292s]  INFO(satman): rank: 1
[    13.726762s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log set_level TRACE
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log set_level TRACE
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log set_level TRACE
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 flash dds2_new
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 reboot
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 flash ao_new
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 reboot
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 flash dio_new
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 reboot
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005169s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020244s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.171450s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.180311s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.189511s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.580199s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     7.568880s]  INFO(libboard_artiq::si5324):   ...locked
[     7.624985s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     7.636956s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     7.674704s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-4e-6f IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:4e6f IPv6: no configured address
[     7.701987s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     7.712242s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     7.725181s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     7.737235s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     7.776353s]  INFO(runtime::comms): no idle kernel found
[    12.737078s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    14.506972s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 34 packets
[    14.540234s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[    14.547959s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link RX became up, pinging
[    14.754973s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] remote replied after 1 packets
[    14.788092s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link initialization completed
[    14.795740s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[    14.802525s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[    14.809210s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
[    14.816093s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] destination is up
[    14.822777s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] buffer space is 128
[    48.948560s]  INFO(runtime::mgmt): received connection
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;crate_mqva_ao_kasli_soc_v1.1_artiq_zynq_master_6315758
[     0.020947s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534789s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.498207s]  INFO(libboard_artiq::si5324):   ...locked
[     2.603419s]  INFO(satman): Switching SYS clocks...
[     2.658283s]  INFO(satman): SYS CLK switched successfully
[     2.683522s]  INFO(satman): SED spreading disabled by default
[     2.701727s]  INFO(satman): uplink is up, switching to recovered clock
[     2.747718s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     4.513466s]  INFO(libboard_artiq::si5324):   ...locked
[     8.767220s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 360, width: 439 (352deg)
[     9.094250s]  INFO(satman): TSC loaded from uplink
[     9.107347s]  INFO(satman): rank: 1
[     9.110824s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[    19.586438s]  INFO(satman): uplink is down, switching to local oscillator clock
[    19.633161s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    21.432418s]  INFO(libboard_artiq::si5324):   ...locked
[    21.438749s] ERROR(satman): received packet of an unknown type
[    28.292275s]  INFO(satman): uplink is up, switching to recovered clock
[    28.338291s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    30.137449s]  INFO(libboard_artiq::si5324):   ...locked
[    33.482959s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 195, width: 439 (352deg)
[    33.810152s]  INFO(satman): TSC loaded from uplink
[    33.823189s]  INFO(satman): rank: 1
[    33.826586s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[    54.547162s]  INFO(satman): uplink is down, switching to local oscillator clock
[    54.593959s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    56.404393s]  INFO(libboard_artiq::si5324):   ...locked
[    56.410718s] ERROR(satman): received packet of an unknown type
[    63.223153s]  INFO(satman): uplink is up, switching to recovered clock
[    63.269088s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    65.123599s]  INFO(libboard_artiq::si5324):   ...locked
[    69.746254s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 426, width: 440 (353deg)
[    70.095054s]  INFO(satman): TSC loaded from uplink
[    70.108065s]  INFO(satman): rank: 1
[    70.111538s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;crate_mqva_dds2_kasli_soc_v1.1_artiq_zynq_master_63157
[     0.020949s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534786s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.487267s]  INFO(libboard_artiq::si5324):   ...locked
[     2.592473s]  INFO(satman): Switching SYS clocks...
[     2.647337s]  INFO(satman): SYS CLK switched successfully
[     2.672819s]  INFO(satman): SED spreading disabled by default
[     2.678559s]  INFO(satman): uplink is up, switching to recovered clock
[     2.724567s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     4.501386s]  INFO(libboard_artiq::si5324):   ...locked
[     7.599083s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 148, width: 441 (354deg)
[     7.926982s]  INFO(satman): TSC loaded from uplink
[     7.940056s]  INFO(satman): rank: 1
[     7.943529s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[   138.444885s]  INFO(satman): uplink is down, switching to local oscillator clock
[   138.491683s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[   140.302251s]  INFO(libboard_artiq::si5324):   ...locked
[   140.308503s] ERROR(satman): received packet of an unknown type
[   147.135472s]  INFO(satman): uplink is up, switching to recovered clock
[   147.181423s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[   149.058245s]  INFO(libboard_artiq::si5324):   ...locked
[   152.508418s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 213, width: 440 (353deg)
[   152.937294s]  INFO(satman): TSC loaded from uplink
[   152.950339s]  INFO(satman): rank: 1
[   152.953815s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[   173.447188s]  INFO(satman): uplink is down, switching to local oscillator clock
[   173.494021s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[   175.293776s]  INFO(libboard_artiq::si5324):   ...locked
[   175.300098s] ERROR(satman): received packet of an unknown type
[   182.512983s]  INFO(satman): uplink is up, switching to recovered clock
[   182.559015s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[   184.479938s]  INFO(libboard_artiq::si5324):   ...locked
[   187.280482s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 95, width: 440 (353deg)
[   189.222224s]  INFO(satman): TSC loaded from uplink
[   189.235177s]  INFO(satman): rank: 1
[   189.238645s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 flash dio_new
~ ❯ artiq_coremgmt -D 192.168.1.30 reboot
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005256s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020341s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.171488s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.180345s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.189546s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.580232s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     7.568851s]  INFO(libboard_artiq::si5324):   ...locked
[     7.624986s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     7.636882s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     7.674776s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-4e-6f IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:4e6f IPv6: no configured address
[     7.702021s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     7.712267s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     7.725232s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     7.737247s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     7.767647s]  INFO(runtime::comms): no idle kernel found
[    12.123200s]  INFO(runtime::mgmt): received connection
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)
~ ❯

dimitsev · 2025-02-04T14:18:50Z

I tried to re-compile our firmware with "enable_wrpll" : false based on an artiq-zynq commit from December 2024. The exact same compilation had succeeded on 9 January 2025. Now, it crashes with: https://git.m-labs.hk/M-Labs/artiq-zynq/issues/356

My goal here was to have a clearly different firmware version with "enable_wrpll" : false to test with artiq_coremgmt -D [IP-address] -s [dest] flash [directory]. If the variant name in artiq_coremgmt log does not change then, we know something is broken also for "enable_wrpll" : false.

It seems that I will have to wait for a fix to https://git.m-labs.hk/M-Labs/artiq-zynq/issues/356 before I can perform the test.

@occheung Are you absolutely sure that artiq_coremgmt -D [IP-address] -s [dest] flash [directory] has an effect on satellites? It does not only complete without errors, but it actually changes the firmware?

dimitsev · 2025-02-04T14:33:33Z

@occheung Btw, I noticed something when "enable_wrpll" : false:

Calling artiq_coremgmt -D 192.168.1.30 -s 2 reboot too soon after artiq_coremgmt -D 192.168.1.30 -s 2 flash dds2_new causes a crash with OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent).
Repeating the reboot command a few seconds later succeeds.
The reboot command even succeeds the first time, if one waits for at least 10 seconds to pass after the flash command.

Possible explanation:

If the SFP connection is slow for any reason (e.g. dirty fiber tips + error correction scheme or bad programming of smth or or or ...), then the transmission of a 5 Megabyte boot.bin file from master to satellite might take several seconds.
While the transmission is in progress, a log command might crash because the responsive will obviously not be a log.

Does this make sense?

occheung · 2025-02-05T02:11:01Z

Does that mean that no errors occurred or that errors are corrected? Could you please clarify if there is an error correction scheme in place?

Yes, there are no error corrections.

We only have error detection. DRTIO itself CRC check when a broken-down package is transmitted to remote.
When you use DRTIO to flash coredevice, it has an additional CRC check right before flashing. This is what appears to tripped the red ERR LED in your case (with WRPLL).

Are you absolutely sure that artiq_coremgmt -D [IP-address] -s [dest] flash [directory] has an effect on satellites?

Yes. The behavior of remote coremgmt flash is:

Transfer the binary via DRTIO. Error messages should be emitted to log if failed.
Turn off DRTIO for that coredevice.
Perform CRC check on the binary and rewrite the firmware if passed, turn on ERR LED if any failed.
The reflashed coredevice reboots itself.

Since there appears to have no errors nor panics from your log and descriptions, flash should indeed updated your firmware. As an additional note, there is an auto reboot step. The reboots you entered after flash should have no meaningful effects.

I hope that this changed the "variant" field in the actual boot.bin files?

Yes it will be reflected on the log.
I have tested your case by flashing another variant to a Kasli-SoC running your non-WRPLL satellite. I can observe the variant changed to the new one.

So it looks like your new issue is due to the new binary not being built? (Hence, flashing the old one. So the variant is the same.)

The reboot command even succeeds the first time, if one waits for at least 10 seconds to pass after the flash command.

Your explanations have already spotted a few sources of latency. There are 2 more sources I would like to point out:

Writing the SD card. SDIO of Zynq device runs on around a few MHz, and your binary is around a few MB, so it also takes a few seconds.
Reboot. You simply needs a few seconds to go from starting the reboot and re-establishing the DRTIO link.

Your log already shows you a rough estimate how much time it takes to get from the bootloader to establishing DRTIO, which is already 12 seconds.

For RISC-V Kasli users: Firmware is written to the flash instead. Writing the firmware to the flash likely takes way more than 10 seconds.

While the transmission is in progress, a log command might crash because the responsive will obviously not be a log.

Indeed. In fact, there won't even be a DRTIO link. We decided to take down the link during the reflashing stage because the satellite will not be able to respond to additional DRTIO messages, and the device is supposed to be rebooted anyway.

If you want the coredevice log during reflash, you can get it via UART. It looks roughly like such when successful.

[  1614.632433s]  INFO(satman::mgmt): CRC passed. Writing boot image to SD card...
[  1616.148990s]  INFO(satman): reboot imminent


                     __________   __
                    / ___/__  /  / /
                    \__ \  / /  / /
                   ___/ / / /__/ /___
                  /____/ /____/_____/

                 (C) 2020-2022 M-Labs

[     0.019988s]  INFO(szl): Simple Zynq Loader starting...
...

Then you will see the usual expected UART log from (re)booting the device.

occheung · 2025-02-06T02:39:14Z

Auxillary transmission has a CRC check per packet (< 1KB in size). It does not look triggered.

Some packets that passed packet CRC corrupted as soon as it de-serialized into Packet enums.

Seems only reproducible on Zynq satellites with WRPLL. No such issue observed on RISC-V WRPLL satellites.

dimitsev changed the title ~~Flashing satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent)~~ Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and corrupts its firmware Feb 3, 2025

dimitsev changed the title ~~Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and corrupts its firmware~~ Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and requires power cycle Feb 3, 2025

Spaqin assigned occheung Feb 3, 2025

dimitsev mentioned this issue Feb 4, 2025

artiq_coremgmt -s X reboot leads to ERROR : received packet of an unknown type #2668

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flashing a satellite Kasli SoC crashes with `Reply.Error (expected Reply.RebootImminent)` and requires power cycle #2667

Flashing a satellite Kasli SoC crashes with `Reply.Error (expected Reply.RebootImminent)` and requires power cycle #2667

dimitsev commented Feb 3, 2025 •

edited

Loading

occheung commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

occheung commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

occheung commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025

dimitsev commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

occheung commented Feb 5, 2025 •

edited

Loading

occheung commented Feb 6, 2025

Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and requires power cycle #2667

Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and requires power cycle #2667

Comments

dimitsev commented Feb 3, 2025 • edited Loading

Bug Report

Expected Behavior

System info

Steps to reproduce

Further details from afterwards

occheung commented Feb 4, 2025 • edited Loading

dimitsev commented Feb 4, 2025 • edited Loading

dimitsev commented Feb 4, 2025 • edited Loading

dimitsev commented Feb 4, 2025 • edited Loading

occheung commented Feb 4, 2025 • edited Loading

dimitsev commented Feb 4, 2025 • edited Loading

occheung commented Feb 4, 2025 • edited Loading

dimitsev commented Feb 4, 2025

dimitsev commented Feb 4, 2025 • edited Loading

dimitsev commented Feb 4, 2025 • edited Loading

dimitsev commented Feb 4, 2025 • edited Loading

occheung commented Feb 5, 2025 • edited Loading

occheung commented Feb 6, 2025

Flashing a satellite Kasli SoC crashes with `Reply.Error (expected Reply.RebootImminent)` and requires power cycle #2667

Flashing a satellite Kasli SoC crashes with `Reply.Error (expected Reply.RebootImminent)` and requires power cycle #2667

dimitsev commented Feb 3, 2025 •

edited

Loading

occheung commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

occheung commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

occheung commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

dimitsev commented Feb 4, 2025 •

edited

Loading

occheung commented Feb 5, 2025 •

edited

Loading