Skip to content

Externals Survey

John Freeman edited this page Dec 31, 2024 · 43 revisions

Spack Packages

  • Current: what's in ext-v2.1, i.e. the externals used by our current nightlies
  • NFDT_DEV_24<MMDD>_A9: what's in the still-in-development ext-v2.2 and used by that particular test nightly
  • Preferred (no checksum): the version labeled as "Preferred" in the latest package.py
  • Latest from checksum: latest version if you run spack checksum <package>
Current NFDT_DEV_241129_A9 NFDT_DEV_241214_A9 NFDT_DEV_241216_A9 NFDT_DEV_241230_A9 Preferred (no checksum) Latest from checksum
abseil-cpp 20240116.2 " " " " "
20240722.0
boost 1.77.0 1.85.0 " " " " 1_87_0_b1
cetlib 3.18.01 " " " " " "
cli11 2.3.2 " " " " " 2.4.2
cmake 3.26.3 " " " " 3.27.9 3.31.1
cppzmq 4.8.1 4.10.0 " " " " "
cyrus-sasl 2.1.27 X X X X X X
dpdk 22.11 " " " " 23.03 "
felix-software dunedaq-v4.2.0 " fddaq-v5.3.0 " " " "
fmt* 8.1.1 10.2.1 " " 8.1.1 10.2.1 11.0.2
folly* 2021.12.13.00 " 2024.12.02.00 " " 2021.05.24.00 2024.12.02.00
gcc 12.1.0 " 13.2.0 " " "
gdb 13.1 " 14.1 " " "
grpc 1.65.1 " " " " "
highfive 2.7.1 " " " " 2.9.0 3.0.0-beta1
intel-tbb 2020.3 " 2021.9.0 " " " 2022.0.0
krb5 1.19.2 X X X X X X
librdkafka 1.7.0 " " " " 2.2.0 2.6.1
msgpack-c 3.3.0 " " " " " 7.0.0
ninja 1.10.0 " 1.11.1? 1.10.0 "
nlohmann-json 3.9.0 3.11.2 " " " " 3.11.3
numactl 2.0.14 " " " " "
openssh 8.7p1 9.7p1 " " " "
openssl 1.1.1t " " " " 3.3.0
pistache* dunedaq-v2.8.0 " fddaq-v5.3.0 dunedaq-v2.8.0* fddaq-v5.3.0* " "
pkgconf 2.2.0 " " " " "
protobuf 4.24.4 " " " " " 29.0
pugixml 1.12.1 " " " " 1.13
py-moo 0.6.7 " " " " " "
py-pybind11 2.6.2 2.12.0 " " " " 2.13.6
python* 3.10.10 " " " " 3.11.7
qt 5.15.12 " " " " "
trace 3.17.14 " " " " "
uhal 2.8.1 " " " " " "

n.b. Pistache version dunedaq-v2.8.0 between NFDT_DEV_241129_A9 and NFDT_DEV_241216_A9 had a patch added so it would build in gcc 13.2.0. Perhaps a bit confusingly, for NFDT_DEV_241230_A9 this patched dunedaq-v2.8.0 got rechristened fddaq-v5.3.0. I made this decision in order to be consistent with my treatment of felix-software, where I'd already bumped the version due to a simple gcc 13.2.0 compatibility patch.

n.b. Python 3.11.7 was used for the not-shown-above test build NFDT_DEV_241213_A9, but caused an immediate failure of drunc; see later in this document for more.

n.b. Between NFDT_241216_A9 and NFDT_241230_A9, folly 2024.12.02.00 was rebuilt so that the -mavx2 option was removed and the FOLLY_F14_FORCE_FALLBACK precompiler #define was added

n.b. fmt was reverted to its original version since there was a compatibility issue with dpdklibs, and its further use is now deprecated anyway

Python packages

current NFDT_DEV_241129_A9 NFDT_DEV_241214_A9 NFDT_DEV_241216_A9 NFDT_DEV_241230_A9 Latest
anytree 2.8.0 2.12.1 " " " "
click 8.1.7 " " " " "
click-didyoumean 0.3.0 0.3.1 " " " "
click-shell 2.1 " " " " "
colorama 0.4.4 0.4.6 " " " "
deepdiff 6.3.1 8.0.1 " " " "
Flask 2.1.1 3.1.0 " " " "
Flask-Cors 3.0.10 5.0.0 " " " "
Flask-Caching X 2.3.0 " " " "
Flask-HTTPAuth 4.6.0 4.8.0 " " " "
Flask-RESTful 0.3.9 0.3.10 " " " "
Flask-SQLAlchemy X 3.1.1 " " " "
graphviz 0.16 0.20.3 " " " "
gunicorn 20.1.0 23.0.0 " " " "
h5py 3.7.0 3.12.1 " " " "
httpx 0.23.3 0.27.2 " " " 0.28.0
kubernetes 23.6.0 31.0.0 " " " "
matplotlib X 3.9.2 " " " 3.9.3
numpy 1.24 2.1.3 " " " "
pandas X 2.2.3 " " " "
pexpect 4.8.0 4.9.0 " " " "
psutil 5.9.0 6.1.0 " " " "
py 1.10.0 1.11.0 " " " "
pytest 8.3.3 " " " " 8.3.4
python-ipmi 0.5.1 0.5.7 " " " "
rsa 4.8 4.9 " " " "
sh 1.14.1 2.1.0 " " " "
textual 0.83.0 0.87.1 " " " 0.88.1
transitions 0.8.10 0.9.2 " " " "

Notes on vendored package.pys

Notes on the package.py's which have been vendored into daq-release over the years. Looking at the head of develop of daq-release, d55d8cf3af4, in spack-repos/externals/packages:

catch2:

Vendoring occured in commit cf73df3e2 from July 3 this year, and appears related to the update to Spack 0.22.0. It's unclear why this bump required vendoring. catch2 doesn't depend on anything.

cetlib-except depends on [email protected]

cetlib depends on catch2

hep-concurrency depends on catch2

cetmodules depends on [email protected] (build-only)

cetlib, cetlib-except, cetmodules:

Vendored because we have no choice

Only fixed dependency is, as described above, [email protected]

cmake:

Vendored so cmake-findprotobuf.patch can be applied to CMake versions 3.23.1 and up

Having said that, there's considerable difference beyond that between what's in daq-release and what's in builtin

cpr: Can be dropped, no longer need in DUNE DAQ

cyrus-sasl: Can be dropped, apparently superfluous dependency

dpdk: The vendored package.py literally dates from 2021

A commit from May 11, 2022: "JCF: Issue #163: add support for a Spack installation of dpdk"

Can this be dropped? Need to keep in mind things like commit 8eb8dd7d from earlier this year, where I deal with a libarchive dependency.

felix-software: Obviously has to be vendored. Need to modify so it works with gcc 13.2

fftw: Drop, was only used by dqm

folly: Needs to remain vendored since incredibly, May 2021 is the latest version in builtin. Chesteron's fence? I was able to at least get it to December 2021. And furthermore, the December 2021 doesn't build under gcc 13.1.0.

More details: whereas its dependency, glog, goes up to 0.7.0, it can only build against 0.6.0 and no later because of a complaint about how it includes headers. The December 2024 version also depends on fast-float, which isn't builtin in Spack, so I've vendored this as well.

grpc: Has to be vendored, builtin only goes up to 1.55.0 and default built is C++11

hep-concurrency: Not even used. cetlib used to depend on this. Or, it depends on it, but only if there's a ~lite build. Are we ever going to have that?

highfive: builtin goes to 2.9.0 while vendored goes to 2.7.1. OTOH, Pengfei added a patch. Also note you needed to add +threadsafe to hdf5 dependency. Edit this to include the later versions.

lcov: needed for my work

librdkafka: vendored in March 2022 for unknown reasons (the classic merge proto-spack commit); openssl dependency added a year later (commit info is add dependency of openssl). builtin doesn't have openssl dependency but does go up to 2.2.0, vendored only goes to 1.7.0 .

libtorrent: vendored goes to 2.0.9; builtin to 0.13.8

libzmq: vendored goes to 4.3.4, builtin to 4.3.5. Added entirely in one commit back in August 2022. Builtin has a patch Fix static assertion failure with gcc-13, not in vendored. Candidate for removal?

msgpack-c: vendored goes to 3.3.0, builtin to 3.1.1. Obvious keep, the only question is what further versions spack checksum would give us

openssh: For externals v2.2 this will only be a dependency of a dependency of git, a dependency of go which is a build-only dependency of rclone. But its history (on Slack, etc.) needs revisiting. Chesterton's Fence.

openssl: You vendored this in the switch to spack-0.22.0 back in the summer but it's unclear why.

perl-timedate: needed for my lcov work

pistache: not a built in, obviously needs to be vendored. Current version used doesn't build in gcc 13.2.0, however, because of missing headers (<cstdint>, IIRC)

protobuf: Vendored latest is 4.24.4, builtin latest is 3.25.3. Also some bespoke abseil-cpp version logic.

pugixml: Added by Pengfei in 2022; it may not have existed in builtin. It does now, though note that builtin has 1.3, 1.11.4 and vendored has 1.12.1, 1.12, 1.11.4. Whether or not to remove it from being vendored depends on whether spack spec picks out 1.3 or not.

py-anyconfig: Added by Pengfei in the original March 2022 commit; doesn't exist in builtin

py-jsonnet: Added by Pengfei in the original March 2022 commit; doesn't exist in builtin

py-fastjsonschema: Added by Pengfei in the original March 2022 commit; identical in builtin

py-sphinxcontrib-moderncmakedomain: Added by Pengfei in the original March 2022 commit; identical in builtin

rclone: The vendoring of this package appears to be related to the set of rcloneConfig.cmake-and-related files created for it. Pengfei, September 2023.

trace: Obviously needed

uhal: Not in builtin, obviously needed. Untouched since September 2022.

Notes on specific nightlies

NFDT_DEV_241129_A9

The minimal_system_quick_test.py works. However, there are two new messages which you don't see when you run using the regular nightly:

  1. WARNING: All log messages before absl::InitializeLog() is called are written to STDERR (really more informational than a warning, in my opinion)
  2. The snippet below is a subset of what you see, this skipping fork() handlers message actually appears for all controller and DAQ applications:
I0000 00:00:1734375559.316994 4109005 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted df-controller uid:                              
                    7500052e-7df4-4b65-a8fa-4d97662cb84f                                                                           
'df-controller' (7500052e-7df4-4b65-a8fa-4d97662cb84f) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "jofreema"                               
                    session: "minimal"                                                                                             
                    name: "dfo-01"                                                                                                 
                    tree_id: "0.0.0"                                                                                               
                                                                                                                                   
I0000 00:00:1734375559.333264 4109005 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
           INFO     ssh_process_manager.py:299      ssh-process-manager:    Booted dfo-01 uid: 08b9cc5b-4d3b-4690-b8b3-18191e3d2075
'dfo-01' (08b9cc5b-4d3b-4690-b8b3-18191e3d2075) process started
           INFO     ssh_process_manager.py:220      ssh-process-manager:    Booting user: "jofreema"                               
                    session: "minimal"                                                                                             
                    name: "df-01"                                                                                                  
                    tree_id: "0.0.1"                                                                                      

Despite the message, integration test performance very much comparable with the regular nightlies: https://github.com/DUNE-DAQ/daq-release/actions/runs/12088527284 . One thing, however, that's appeared in the output of listrev_test.py (but not the tests from integrationtest are SIGHUP messages (reproduced below). It's not clear this is actually a problem since this always occurs after the data taking is complete and the system has been wound down

Problem(s) found in logfile /tmp/pytest-of-dunedaq/pytest-1998/run4/log_dunedaq_lr-session_local-connection-server.txt:
Error: 2-16 19:35:37 -0600] [4080069] [ERROR] Worker (pid:4080077) was sent SIGHUP!

NFDT_DEV_241213_A9

Here, Python 3.11.7 was used. However, running minimal_system_quick_test.py there's an immediate failure thanks to the dataclasses module:

(dbt) [jofreema@np04-srv-019 /nfs/sw/work_dirs/jcfree/NFDT_DEV_241213_A9]$ pytest -s -v $DAQSYSTEMTEST_SHARE/integtest/minimal_system_quick_test.py
======================================================= test session starts =======================================================
platform linux -- Python 3.11.7, pytest-8.3.3, pluggy-1.5.0 -- /nfs/sw/work_dirs/jcfree/NFDT_DEV_241213_A9/.venv/bin/python
cachedir: .pytest_cache
rootdir: /cvmfs/dunedaq-development.opensciencegrid.org/nightly/NFDT_DEV_241213_A9/spack-0.22.0
configfile: pytest.ini
plugins: anyio-4.6.2.post1, integrationtest-3.1.0
collected 0 items / 1 error                                                                                                       

============================================================= ERRORS ==============================================================
_ ERROR collecting opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/daqsystemtest-NFDT_DEV_241213_A9-5mfgaaoiauk5tv2cp65bc2fiwiy3vvtc/share/integtest/minimal_system_quick_test.py _
/cvmfs/dunedaq-development.opensciencegrid.org/nightly/NFDT_DEV_241213_A9/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/daqsystemtest-NFDT_DEV_241213_A9-5mfgaaoiauk5tv2cp65bc2fiwiy3vvtc/share/integtest/minimal_system_quick_test.py:6: in <module>
    import integrationtest.data_classes as data_classes
<frozen importlib._bootstrap>:1176: in _find_and_load
    ???
<frozen importlib._bootstrap>:1147: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:690: in _load_unlocked
    ???
.venv/lib/python3.11/site-packages/_pytest/assertion/rewrite.py:184: in exec_module
    exec(co, module.__dict__)
.venv/lib/python3.11/site-packages/integrationtest/data_classes.py:21: in <module>
    @dataclass
/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.2/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/python-3.11.7-svtllxgwlzx3niutdi32fxz7wbaapnbs/lib/python3.11/dataclasses.py:1230: in dataclass
    return wrap(cls)
/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.2/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/python-3.11.7-svtllxgwlzx3niutdi32fxz7wbaapnbs/lib/python3.11/dataclasses.py:1220: in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash,
/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.2/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/python-3.11.7-svtllxgwlzx3niutdi32fxz7wbaapnbs/lib/python3.11/dataclasses.py:958: in _process_class
    cls_fields.append(_get_field(cls, name, type, kw_only))
/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.2/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/python-3.11.7-svtllxgwlzx3niutdi32fxz7wbaapnbs/lib/python3.11/dataclasses.py:815: in _get_field
    raise ValueError(f'mutable default {type(f.default)} for field '
E   ValueError: mutable default <class 'integrationtest.data_classes.DROMap_config'> for field dro_map_config is not allowed: use default_factory
===================================================== short test summary info =====================================================
ERROR ../../../../../cvmfs/dunedaq-development.opensciencegrid.org/nightly/NFDT_DEV_241213_A9/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/daqsystemtest-NFDT_DEV_241213_A9-5mfgaaoiauk5tv2cp65bc2fiwiy3vvtc/share/integtest/minimal_system_quick_test.py - ValueError: mutable default <class 'integrationtest.data_classes.DROMap_config'> for field dro_map_config is not allowed: use ...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================================================== 1 error in 4.93s =========================================================

NFDT_DEV_241214_A9

The major difference wrt the NFDT_DEV_241213_A9 nightly is that Python gets reverted back to the "classic" 3.10.10 rather than the 3.11.7 given the failures from dataclasses. The main sticking point this time is that when running minimal_system_quick_test.py, while the configuration transition goes off without a hitch, the start transition reliably hangs for every process, whether it's a controller process or DAQ process (e.g., mlt, df-01, etc.):

Running transition 'start' on controller 'root-controller'
[16:03:03] ERROR    shell_utils.py:138      controller_driver:      Command 'execute_fsm_command' failed
                    on 'mlt' (response flag 'DRUNC_EXCEPTION_THROWN')

and if you looked at an individual process log, whereas you'd see something like this for the configuration transition:

2024-Dec-14 15:51:47,687 LOG [void dunedaq::restcmd::RestEndpoint::handleResponseCommand(const dunedaq::restcmd::cmdobj_t&, dunedaq::cmdlib::cmd::CommandReply&) at /tmp/root/spack-stage/spack-stage-restcmd-NBT_DEV_241214_A9-oz47s7aon65lh4f2ncc6ku4dhn4msxgl/spack-src/src/RestEndpoint.cpp:102] Sending POST request to daq.fnal.gov:59469/response
2024-Dec-14 15:51:47,781 LOG [dunedaq::restcmd::RestEndpoint::handleResponseCommand(const dunedaq::restcmd::cmdobj_t&, dunedaq::cmdlib::cmd::CommandReply&)::<lambda(Pistache::Http::Response)> at /tmp/root/spack-stage/spack-stage-restcmd-NBT_DEV_241214_A9-oz47s7aon65lh4f2ncc6ku4dhn4msxgl/spack-src/src/RestEndpoint.cpp:109] Response code = OK

you'd always see a hang after Sending POST request ... for the start transition:

2024-Dec-14 15:51:49,809 LOG [void dunedaq::restcmd::RestEndpoint::handleResponseCommand(const dunedaq::restcmd::cmdobj_t&, dunedaq::cmdlib::cmd::CommandReply&) at /tmp/root/spack-stage/spack-stage-restcmd-NBT_DEV_241214_A9-oz47s7aon65lh4f2ncc6ku4dhn4msxgl/spack-src/src/RestEndpoint.cpp:102] Sending POST request to daq.fnal.gov:59469/response

This is what led me to hypothesize that the problem had to do with the Pistache package which restcmd depends on and which I'd bumped up from its classic, October 2020 commit to a December 2024 commit between NFDT_DEV_241129_A9 and NFDT_DEV_241214_A9.

Now, another important point: whether it's because of the bump from gcc 12.1.0 to gcc 13.2.0 or the bump of folly from 2021.12.13.00 to 2024.12.02.00, it seems folly refuses to link to a translation unit if it can tell that the translation unit was built with a different set of flags (it uses F14LinkCheck for this). As a result, since fdreadoutlibs uses the -mavx2 option to access Advanced Vector Extensions 2, I needed to move this flag into daq-cmake's daq_setup_environment so that all code would build with it.

NFDT_DEV_241216_A9

As just mentioned, the major change here was to revert Pistache to its October 2020 commit, which we've been using for years. With one catch: I had to add a patch in order for it to build against gcc 13.2.0 rather than 12.1.0. It's the same change I've made in other packages, namely, adding <cXXXXXX> includes where needed (e.g., <cstdlib>).

With this change made, the hang went away. E.g., integration tests worked about as well as you'd expect, especially given the recent woes of daq.fnal.gov: https://github.com/DUNE-DAQ/daq-release/actions/runs/12364412829

Having said that, a couple of things to note:

  1. Specifically on protodune-daq01.fnal.gov, the integration tests (and builds, etc.) don't seem to work. Will be investigated but I wonder if it's related to (1) the jump from gcc 12.1.0 to gcc 13.2.0 and/or (2) the expanded instruction set from the -mavx2 option passed to gcc. On the np04 cluster and daq.fnal.gov, this doesn't show up.

NFDT_DEV_241230_A9

The main change here is that now, rather than building folly + all the DUNE DAQ packages with the -mavx2 option as was the case for NFDT_DEV_241214_A9 and NFDT_DEV_241216_A9, I build it with the FOLLY_F14_FORCE_FALLBACK preprocessor #defined and set to 1. I also build the DUNE DAQ packages this way. This allows folly to link against translation units with different build flags.

Note that with this change, it's possible to build the full DUNE DAQ stack on protodune-daq01.fnal.gov whereas it wasn't before (in fact, this was done to create NFDU_DEV_241230_A9, same as NFDT_DEV_241230_A9 except the choice of build machine). The integration tests continue to fail, but get further than under NFDT_DEV_241216_A9. Whereas with the NFDT_DEV_241216_A9 integration tests on protodune-daq01 nothing even booted (https://github.com/DUNE-DAQ/daq-release/actions/runs/12376009563) now with NFDU_DEV_241216_A9 things boot, but all the applications then crash (https://github.com/DUNE-DAQ/daq-release/actions/runs/12376009563).

Digging down, what I know so far is that if, in a NFDU_DEV_241230_A9-based work area on protodune-daq01 I try this directly from the command line:

daq_application -s minimal --name mlt -c rest://localhost:0 --configurationService oksconflibs:/tmp/pytest-of-dunedaq/pytest-18/config0/integtest-session-resolved.data.xml

then there's a crash with a complaint about an Illegal Instruction (this is an example of a daq_application call done during integration tests). Will investigate some more, though it's important to note that the integration tests run fine on np04 and daq.fnal.gov.