Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and requires power cycle #2667

Open
dimitsev opened this issue Feb 3, 2025 · 13 comments
Assignees

Comments

@dimitsev
Copy link

dimitsev commented Feb 3, 2025

Bug Report

artiq_coremgmt -D 192.168.1.30 -s 1 flash .

crashed with

Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 196, in main
    mgmt.flash(bins)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 221, in flash
    self._read_expect(Reply.RebootImminent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.RebootImminent)

Afterwards, all commands were crashing on that satellite, including artiq_coremgmt -D 192.168.1.30 -s 1 log. Everything worked again after a power-cycle but the log command revealed that the variant had not been upgraded, i.e. the flash had failed.

Expected Behavior

Artiq beta-manual: Writing the flash clearly states

artiq_coremgmt [-D IP_address] -s <destination_number> flash <afws_directory>

so I would expect

artiq_coremgmt -D 192.168.1.30 -s 1 flash .

to succeed, provided that . contains a boot.bin file.

System info

Steps to reproduce

  • We compiled artiq-zynq commit 63157588bb8ed214bc9f24ade2d09b4ddefada03 from 15 January 2025 into new boot.bin files for all 3 Kasli SoCs.
  • Then, we entered the directory of one of the new boot.bin files and ran
artiq_coremgmt -D 192.168.1.30 -s 1 flash .

to flash the satellite at -s 1, which is connected to the master's SFP0.

Further details from afterwards

I tried to flash the master. These commands all succeeded and the log revealed the new variant name:

artiq_coremgmt -D 192.168.1.30 flash .
artiq_coremgmt -D 192.168.1.30 reboot
artiq_coremgmt -D 192.168.1.30 log

Afterwards, satellite 2 (untouched) still reacted correctly to artiq_coremgmt -D 192.168.1.30 -s 2 log but satellite 1 (which I had tried to flash) crashed artiq_coremgmt -D 192.168.1.30 -s 1 log with:

Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)

I tried to reboot satellite 1 (which I had tried to flash) with artiq_coremgmt -D 192.168.1.30 -s 1 reboot but it crashed with:

Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 199, in main
    mgmt.reboot()
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 199, in reboot
    self._read_expect(Reply.RebootImminent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.RebootImminent)

I tried satellite 1's log again, but artiq_coremgmt -D 192.168.1.30 -s 1 log still crashed with:

Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)

After a power cycle, satellite 1 worked again but the log command revealed the old variant name, not the new, desired one.

@dimitsev dimitsev changed the title Flashing satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and corrupts its firmware Feb 3, 2025
@dimitsev dimitsev changed the title Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and corrupts its firmware Flashing a satellite Kasli SoC crashes with Reply.Error (expected Reply.RebootImminent) and requires power cycle Feb 3, 2025
@occheung
Copy link
Contributor

occheung commented Feb 4, 2025

OSError: Incorrect reply from device: Reply.Error (expected Reply.RebootImminent)

If the binary is corrupted/incorrect, it should still give a RebootImminent message (with error LED after failing CRC). And,

satellite 1 (which I had tried to flash) crashed artiq_coremgmt -D 192.168.1.30 -s 1 log

It looks me that only satellite 1 does not recognize any remote coremgmt commands, as coremgmt commands report correctly for satellite 2. Could you check if flashing works for satellite 2?

It is also possible that the master did not acquire an DRTIO link with satellite 1, please provide us UART logs for all coredevices to check.

artiq_coremgmt -D 192.168.1.30 -s 1 flash .

This works for me when in the same setting, and an expected ValueError when both RISC-V and Zynq binaries are detected:
ValueError: both risc-v and zynq binaries were found, please clean up your build directory.

@dimitsev
Copy link
Author

dimitsev commented Feb 4, 2025

@occheung

If the binary is corrupted/incorrect, it should still give a RebootImminent message (with error LED after failing CRC).

The ERR LED on the satellite Kasli SoC's front panel indeed turns on (glowing red) as soon as I call artiq_coremgmt -D 192.168.1.30 -s 1 flash .. The ERR LED only turns off again after power cycle.

It looks me that only satellite 1 does not recognize any remote coremgmt commands, as coremgmt commands report correctly for satellite 2. Could you check if flashing works for satellite 2?

  • I have obviously tried artiq_coremgmt -D 192.168.1.30 -s 2 flash . as well (in the directory with the appropriate binary). I have tried the flash command on both satellites multiple times, always after one or several power cycles and after verifying that the log command was working and that the start-up log was free from errors.
  • For both satellites, the flash command either crashes with OSError: Incorrect reply from device: Reply.Error (expected Reply.RebootImminent) or it completes without error. Which of these 2 alternatives happens is non-deterministic.
  • But the "flashed" satellite's ERR LED always turns on immediately after the flash command and every single command after that (such as artiq_coremgmt -D 192.168.1.30 -s 1 log or artiq_coremgmt -D 192.168.1.30 -s 2 log) always crashes (e.g. with OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)).

It is also possible that the master did not acquire an DRTIO link with satellite 1, please provide us UART logs for all coredevices to check.

  • If this was a non-deterministic problem with the DRTIO link, I would expect the log commands to also fail sometimes non-deterministically. However, they have never once failed on either satellite after 1-2 power cycles, i.e. after they start working again. And I called them multiple times in succession, just to check.

@dimitsev
Copy link
Author

dimitsev commented Feb 4, 2025

After power cycle:

~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005248s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020339s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.191967s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.200821s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.209944s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.405688s]  INFO(libboard_artiq::si549): Main Si549 started
[     0.461999s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     0.473963s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.511293s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-4e-6f IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:4e6f IPv6: no configured address
[     0.538850s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     0.549098s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     0.561875s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     0.573243s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link RX became up, pinging
[     0.603360s]  INFO(runtime::comms): no idle kernel found
[     0.979998s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] remote replied after 2 packets
[     1.013117s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link initialization completed
[     1.020848s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[     1.027478s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] destination is up
[     1.034216s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] buffer space is 128
[     4.256994s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     4.463992s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 1 packets
[     4.497206s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[     4.504959s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[     4.511649s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
[     5.073092s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    46.995288s]  INFO(runtime::mgmt): received connection
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;kasli_soc_artiq_master_20250109_drtio_satellite_wrpll_
[     0.020948s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.361131s]  INFO(libboard_artiq::si549): Main Si549 started
[     0.466864s]  INFO(satman): Switching SYS clocks...
[     0.521727s]  INFO(satman): SYS CLK switched successfully
[     0.714138s]  INFO(libboard_artiq::si549): Helper Si549 started
[     0.740633s]  INFO(satman): SED spreading disabled by default
[     0.771977s]  INFO(satman): uplink is up, switching to recovered clock
[     0.912916s]  INFO(libboard_artiq::si549::wrpll): WRPLL interrupt enabled
[     0.939693s]  INFO(libboard_artiq::si549::wrpll): testing the skew of SYS CLK...
[     0.952077s]  INFO(libboard_artiq::si549::wrpll): the skew of SYS CLK met the timing constraint
[     4.511209s]  INFO(satman): TSC loaded from uplink
[     4.524247s]  INFO(satman): rank: 1
[     4.527723s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;kasli_soc_artiq_master_20250109_drtio_satellite_wrpll_
[     0.020948s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.360959s]  INFO(libboard_artiq::si549): Main Si549 started
[     0.466690s]  INFO(satman): Switching SYS clocks...
[     0.521553s]  INFO(satman): SYS CLK switched successfully
[     0.714071s]  INFO(libboard_artiq::si549): Helper Si549 started
[     0.739640s]  INFO(satman): SED spreading disabled by default
[     0.825818s]  INFO(satman): uplink is up, switching to recovered clock
[     0.966749s]  INFO(libboard_artiq::si549::wrpll): WRPLL interrupt enabled
[     0.993526s]  INFO(libboard_artiq::si549::wrpll): testing the skew of SYS CLK...
[     1.005912s]  INFO(libboard_artiq::si549::wrpll): the skew of SYS CLK met the timing constraint
[     1.266050s]  INFO(satman): TSC loaded from uplink
[     1.279000s]  INFO(satman): rank: 1
[     1.282472s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log clear
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log clear
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log clear
~/workspace ❯ 

Attempting to flash satellite 2:

  • cd to the appropriate directory.
  • artiq_coremgmt -D 192.168.1.30 -s 2 flash . completes without error.

Afterwards:

~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[    62.327674s]  INFO(runtime::mgmt): received connection
[    65.090870s]  INFO(runtime::mgmt): received connection
[    77.640481s]  INFO(runtime::mgmt): received connection
[    79.344857s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link is down
[    79.351150s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] destination is down
[    87.439046s]  INFO(runtime::mgmt): received connection
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
~/workspace ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log
Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)
~/workspace ❯ 

I will check the UART logs next.

@dimitsev
Copy link
Author

dimitsev commented Feb 4, 2025

@occheung

Actually, first I tested rebooting all 3 Kasli SoCs, one after the other, and just rebooting any of the devices causes errors on other devices, see issue #2668 .

I think there might be a problem with the programming of artiq_coremgmt reboot or of the communication following it.

I will check the UART logs now.

@occheung
Copy link
Contributor

occheung commented Feb 4, 2025

@dimitsev

What likely happened is a CRC error due to corrupted image.
I tried remote coremgmt writing a large lipsum text (100 paragraphs) as a file to a satellite built using your JSON in the forum.
I do find discrepancy on the lipsum file on the satellite device to the original.

@dimitsev
Copy link
Author

dimitsev commented Feb 4, 2025

@occheung Thank you so much for looking into it!

If I understand you correctly, then the SFP connection between master and satellites implements the error detection method cyclic redundancy check, but it does react with appropriate log messages to CRC errors and it does not implement any kind of error correction scheme. When a transmission error happens, it simply happens and one has to try transmitting again.

Did I understand you correctly?

P.S. How do I write arbitrary files to the SD card's storage using artiq_coremgmt?

@occheung
Copy link
Contributor

occheung commented Feb 4, 2025

Auxillary transmission has a CRC check per packet (< 1KB in size). It does not look triggered. We will investigate this further.

You may use artiq_coremgmt config write -f <key> <file> to write file. -s argument is supported.
Meanwhile we do find CRC passes and able to reflash in the same variant without WRPLL, perhaps related.

@dimitsev
Copy link
Author

dimitsev commented Feb 4, 2025

@occheung Great! I am re-compiling with "enable_wrpll" : false now.

Meanwhile we do find CRC passes

Does that mean that no errors occurred or that errors are corrected? Could you please clarify if there is an error correction scheme in place?

@dimitsev
Copy link
Author

dimitsev commented Feb 4, 2025

  • I set "enable_wrpll" : false in our 3 system description files and re-compiled the 3 firmwares based on artiq-zynq commit 63157588bb8ed214bc9f24ade2d09b4ddefada03 from 15 January 2025.
  • Then, I removed the SD cards from our 3 Kasli SoCs, put them into a computer, replaced boot.bin by the new ones and put the SD cards back into the Kasli SoCs.
  • Then, I changed "variant" in all 3 system description files to "dio_new", "ao_new" and "dds2_new" respectively and re-compiled the exact same firmware again. I hope that this changed the "variant" field in the actual boot.bin files?
  • Finally, I tried to flash these firmwares with the new "variant"-values to our 3 Kasli SoCs via Ethernet.

No errors were thrown (at least not in the beginning), but the variant names in the logs did not change:

~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[     0.000066s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005247s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020333s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.171031s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.179885s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.189086s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.579770s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     7.821356s]  INFO(libboard_artiq::si5324):   ...locked
[     7.876985s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     7.888956s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     7.926323s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-4e-6f IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:4e6f IPv6: no configured address
[     7.953866s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     7.964118s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     7.976935s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     7.988538s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     8.027523s]  INFO(runtime::comms): no idle kernel found
[    12.988080s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    13.199974s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 26 packets
[    13.233256s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[    13.240985s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link RX became up, pinging
[    13.447974s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] remote replied after 1 packets
[    13.481170s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link initialization completed
[    13.488896s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[    13.495705s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[    13.502393s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
[    13.509276s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] destination is up
[    13.515960s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] buffer space is 128
[    24.824376s]  INFO(runtime::mgmt): received connection
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;crate_mqva_ao_kasli_soc_v1.1_artiq_zynq_master_6315758
[     0.020946s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534694s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     3.168833s]  INFO(libboard_artiq::si5324):   ...locked
[     3.274045s]  INFO(satman): Switching SYS clocks...
[     3.328909s]  INFO(satman): SYS CLK switched successfully
[     3.354526s]  INFO(satman): SED spreading disabled by default
[     8.016445s]  INFO(satman): uplink is up, switching to recovered clock
[     8.062459s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     9.872650s]  INFO(libboard_artiq::si5324):   ...locked
[    12.926362s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 140, width: 441 (354deg)
[    13.253797s]  INFO(satman): TSC loaded from uplink
[    13.266820s]  INFO(satman): rank: 1
[    13.270289s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.005951s]  INFO(satman): gateware ident 9+unknown;crate_mqva_dds2_kasli_soc_v1.1_artiq_zynq_master_63157
[     0.020932s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534756s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.476197s]  INFO(libboard_artiq::si5324):   ...locked
[     2.581329s]  INFO(satman): Switching SYS clocks...
[     2.636193s]  INFO(satman): SYS CLK switched successfully
[     2.661519s]  INFO(satman): SED spreading disabled by default
[     8.166696s]  INFO(satman): uplink is up, switching to recovered clock
[     8.212716s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    10.000691s]  INFO(libboard_artiq::si5324):   ...locked
[    13.076421s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 144, width: 441 (354deg)
[    13.710264s]  INFO(satman): TSC loaded from uplink
[    13.723292s]  INFO(satman): rank: 1
[    13.726762s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log set_level TRACE
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log set_level TRACE
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log set_level TRACE
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 flash dds2_new
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 reboot
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 flash ao_new
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 reboot
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 flash dio_new
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 reboot
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005169s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020244s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.171450s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.180311s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.189511s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.580199s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     7.568880s]  INFO(libboard_artiq::si5324):   ...locked
[     7.624985s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     7.636956s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     7.674704s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-4e-6f IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:4e6f IPv6: no configured address
[     7.701987s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     7.712242s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     7.725181s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     7.737235s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     7.776353s]  INFO(runtime::comms): no idle kernel found
[    12.737078s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    14.506972s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] remote replied after 34 packets
[    14.540234s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link initialization completed
[    14.547959s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link RX became up, pinging
[    14.754973s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] remote replied after 1 packets
[    14.788092s]  INFO(runtime::rtio_mgt::drtio): [LINK#1] link initialization completed
[    14.795740s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[    14.802525s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] destination is up
[    14.809210s]  INFO(runtime::rtio_mgt::drtio): [DEST#1] buffer space is 128
[    14.816093s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] destination is up
[    14.822777s]  INFO(runtime::rtio_mgt::drtio): [DEST#2] buffer space is 128
[    48.948560s]  INFO(runtime::mgmt): received connection
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;crate_mqva_ao_kasli_soc_v1.1_artiq_zynq_master_6315758
[     0.020947s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534789s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.498207s]  INFO(libboard_artiq::si5324):   ...locked
[     2.603419s]  INFO(satman): Switching SYS clocks...
[     2.658283s]  INFO(satman): SYS CLK switched successfully
[     2.683522s]  INFO(satman): SED spreading disabled by default
[     2.701727s]  INFO(satman): uplink is up, switching to recovered clock
[     2.747718s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     4.513466s]  INFO(libboard_artiq::si5324):   ...locked
[     8.767220s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 360, width: 439 (352deg)
[     9.094250s]  INFO(satman): TSC loaded from uplink
[     9.107347s]  INFO(satman): rank: 1
[     9.110824s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[    19.586438s]  INFO(satman): uplink is down, switching to local oscillator clock
[    19.633161s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    21.432418s]  INFO(libboard_artiq::si5324):   ...locked
[    21.438749s] ERROR(satman): received packet of an unknown type
[    28.292275s]  INFO(satman): uplink is up, switching to recovered clock
[    28.338291s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    30.137449s]  INFO(libboard_artiq::si5324):   ...locked
[    33.482959s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 195, width: 439 (352deg)
[    33.810152s]  INFO(satman): TSC loaded from uplink
[    33.823189s]  INFO(satman): rank: 1
[    33.826586s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[    54.547162s]  INFO(satman): uplink is down, switching to local oscillator clock
[    54.593959s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    56.404393s]  INFO(libboard_artiq::si5324):   ...locked
[    56.410718s] ERROR(satman): received packet of an unknown type
[    63.223153s]  INFO(satman): uplink is up, switching to recovered clock
[    63.269088s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    65.123599s]  INFO(libboard_artiq::si5324):   ...locked
[    69.746254s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 426, width: 440 (353deg)
[    70.095054s]  INFO(satman): TSC loaded from uplink
[    70.108065s]  INFO(satman): rank: 1
[    70.111538s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.006029s]  INFO(satman): gateware ident 9+unknown;crate_mqva_dds2_kasli_soc_v1.1_artiq_zynq_master_63157
[     0.020949s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.534786s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.487267s]  INFO(libboard_artiq::si5324):   ...locked
[     2.592473s]  INFO(satman): Switching SYS clocks...
[     2.647337s]  INFO(satman): SYS CLK switched successfully
[     2.672819s]  INFO(satman): SED spreading disabled by default
[     2.678559s]  INFO(satman): uplink is up, switching to recovered clock
[     2.724567s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     4.501386s]  INFO(libboard_artiq::si5324):   ...locked
[     7.599083s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 148, width: 441 (354deg)
[     7.926982s]  INFO(satman): TSC loaded from uplink
[     7.940056s]  INFO(satman): rank: 1
[     7.943529s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[   138.444885s]  INFO(satman): uplink is down, switching to local oscillator clock
[   138.491683s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[   140.302251s]  INFO(libboard_artiq::si5324):   ...locked
[   140.308503s] ERROR(satman): received packet of an unknown type
[   147.135472s]  INFO(satman): uplink is up, switching to recovered clock
[   147.181423s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[   149.058245s]  INFO(libboard_artiq::si5324):   ...locked
[   152.508418s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 213, width: 440 (353deg)
[   152.937294s]  INFO(satman): TSC loaded from uplink
[   152.950339s]  INFO(satman): rank: 1
[   152.953815s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[   173.447188s]  INFO(satman): uplink is down, switching to local oscillator clock
[   173.494021s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[   175.293776s]  INFO(libboard_artiq::si5324):   ...locked
[   175.300098s] ERROR(satman): received packet of an unknown type
[   182.512983s]  INFO(satman): uplink is up, switching to recovered clock
[   182.559015s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[   184.479938s]  INFO(libboard_artiq::si5324):   ...locked
[   187.280482s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 95, width: 440 (353deg)
[   189.222224s]  INFO(satman): TSC loaded from uplink
[   189.235177s]  INFO(satman): rank: 1
[   189.238645s]  INFO(satman): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 2 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log clear
~ ❯ artiq_coremgmt -D 192.168.1.30 flash dio_new
~ ❯ artiq_coremgmt -D 192.168.1.30 reboot
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 0 log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005256s]  INFO(runtime): gateware ident: 9+unknown;crate_mqva_dio_kasli_soc_v1.1_artiq_zynq_master_631575
[     0.020341s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.171488s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.180345s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.189546s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.580232s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     7.568851s]  INFO(libboard_artiq::si5324):   ...locked
[     7.624986s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     7.636882s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     7.674776s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-0f-4e-6f IPv4=192.168.1.30 IPv6-LL=fe80::fe0f:e7ff:fe0f:4e6f IPv6: no configured address
[     7.702021s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     7.712267s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     7.725232s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     7.737247s]  INFO(runtime::rtio_mgt::drtio): [LINK#0] link RX became up, pinging
[     7.767647s]  INFO(runtime::comms): no idle kernel found
[    12.123200s]  INFO(runtime::mgmt): received connection
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)
~ ❯ artiq_coremgmt -D 192.168.1.30 -s 1 log
Traceback (most recent call last):
  File "/home/dimitry/.pyenv/versions/artiq-py311/bin/artiq_coremgmt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/frontend/artiq_coremgmt.py", line 144, in main
    print(mgmt.get_log(), end="")
          ^^^^^^^^^^^^^^
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 138, in get_log
    self._read_expect(Reply.LogContent)
  File "/opt/m1/workspace/artiq-original-untouched/artiq/coredevice/comm_mgmt.py", line 121, in _read_expect
    raise IOError("Incorrect reply from device: {} (expected {})".
OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent)
~ ❯ 

@dimitsev
Copy link
Author

dimitsev commented Feb 4, 2025

I tried to re-compile our firmware with "enable_wrpll" : false based on an artiq-zynq commit from December 2024. The exact same compilation had succeeded on 9 January 2025. Now, it crashes with: https://git.m-labs.hk/M-Labs/artiq-zynq/issues/356

My goal here was to have a clearly different firmware version with "enable_wrpll" : false to test with artiq_coremgmt -D [IP-address] -s [dest] flash [directory]. If the variant name in artiq_coremgmt log does not change then, we know something is broken also for "enable_wrpll" : false.

It seems that I will have to wait for a fix to https://git.m-labs.hk/M-Labs/artiq-zynq/issues/356 before I can perform the test.

@occheung Are you absolutely sure that artiq_coremgmt -D [IP-address] -s [dest] flash [directory] has an effect on satellites? It does not only complete without errors, but it actually changes the firmware?

@dimitsev
Copy link
Author

dimitsev commented Feb 4, 2025

@occheung Btw, I noticed something when "enable_wrpll" : false:

  • Calling artiq_coremgmt -D 192.168.1.30 -s 2 reboot too soon after artiq_coremgmt -D 192.168.1.30 -s 2 flash dds2_new causes a crash with OSError: Incorrect reply from device: Reply.Error (expected Reply.LogContent).
  • Repeating the reboot command a few seconds later succeeds.
  • The reboot command even succeeds the first time, if one waits for at least 10 seconds to pass after the flash command.

Possible explanation:

  • If the SFP connection is slow for any reason (e.g. dirty fiber tips + error correction scheme or bad programming of smth or or or ...), then the transmission of a 5 Megabyte boot.bin file from master to satellite might take several seconds.
  • While the transmission is in progress, a log command might crash because the responsive will obviously not be a log.

Does this make sense?

@occheung
Copy link
Contributor

occheung commented Feb 5, 2025

Does that mean that no errors occurred or that errors are corrected? Could you please clarify if there is an error correction scheme in place?

Yes, there are no error corrections.

We only have error detection. DRTIO itself CRC check when a broken-down package is transmitted to remote.
When you use DRTIO to flash coredevice, it has an additional CRC check right before flashing. This is what appears to tripped the red ERR LED in your case (with WRPLL).

Are you absolutely sure that artiq_coremgmt -D [IP-address] -s [dest] flash [directory] has an effect on satellites?

Yes. The behavior of remote coremgmt flash is:

  1. Transfer the binary via DRTIO. Error messages should be emitted to log if failed.
  2. Turn off DRTIO for that coredevice.
  3. Perform CRC check on the binary and rewrite the firmware if passed, turn on ERR LED if any failed.
  4. The reflashed coredevice reboots itself.

Since there appears to have no errors nor panics from your log and descriptions, flash should indeed updated your firmware. As an additional note, there is an auto reboot step. The reboots you entered after flash should have no meaningful effects.

I hope that this changed the "variant" field in the actual boot.bin files?

Yes it will be reflected on the log.
I have tested your case by flashing another variant to a Kasli-SoC running your non-WRPLL satellite. I can observe the variant changed to the new one.

So it looks like your new issue is due to the new binary not being built? (Hence, flashing the old one. So the variant is the same.)

The reboot command even succeeds the first time, if one waits for at least 10 seconds to pass after the flash command.

Your explanations have already spotted a few sources of latency. There are 2 more sources I would like to point out:

  1. Writing the SD card. SDIO of Zynq device runs on around a few MHz, and your binary is around a few MB, so it also takes a few seconds.
  2. Reboot. You simply needs a few seconds to go from starting the reboot and re-establishing the DRTIO link.

Your log already shows you a rough estimate how much time it takes to get from the bootloader to establishing DRTIO, which is already 12 seconds.

For RISC-V Kasli users: Firmware is written to the flash instead. Writing the firmware to the flash likely takes way more than 10 seconds.

While the transmission is in progress, a log command might crash because the responsive will obviously not be a log.

Indeed. In fact, there won't even be a DRTIO link. We decided to take down the link during the reflashing stage because the satellite will not be able to respond to additional DRTIO messages, and the device is supposed to be rebooted anyway.

If you want the coredevice log during reflash, you can get it via UART. It looks roughly like such when successful.

[  1614.632433s]  INFO(satman::mgmt): CRC passed. Writing boot image to SD card...
[  1616.148990s]  INFO(satman): reboot imminent


                     __________   __
                    / ___/__  /  / /
                    \__ \  / /  / /
                   ___/ / / /__/ /___
                  /____/ /____/_____/

                 (C) 2020-2022 M-Labs

[     0.019988s]  INFO(szl): Simple Zynq Loader starting...
...

Then you will see the usual expected UART log from (re)booting the device.

@occheung
Copy link
Contributor

occheung commented Feb 6, 2025

Auxillary transmission has a CRC check per packet (< 1KB in size). It does not look triggered.

Some packets that passed packet CRC corrupted as soon as it de-serialized into Packet enums.

Seems only reproducible on Zynq satellites with WRPLL. No such issue observed on RISC-V WRPLL satellites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants