-
Notifications
You must be signed in to change notification settings - Fork 0
Externals Survey
-
Current
: what's in ext-v2.1, i.e. the externals used by our current nightlies -
NFDT_DEV_24<MMDD>_A9
: what's in the still-in-development ext-v2.2 and used by that particular test nightly -
Preferred (no checksum)
: the version labeled as "Preferred" in the latestpackage.py
-
Latest from checksum
: latest version if you runspack checksum <package>
Current | NFDT_DEV_241129_A9 | NFDT_DEV_241214_A9 | NFDT_DEV_241216_A9 | NFDT_DEV_241230_A9 | Preferred (no checksum) | Latest from checksum | |
---|---|---|---|---|---|---|---|
abseil-cpp | 20240116.2 | " | " | " | " | " | |
20240722.0 | |||||||
boost | 1.77.0 | 1.85.0 | " | " | " | " | 1_87_0_b1 |
cetlib | 3.18.01 | " | " | " | " | " | " |
cli11 | 2.3.2 | " | " | " | " | " | 2.4.2 |
cmake | 3.26.3 | " | " | " | " | 3.27.9 | 3.31.1 |
cppzmq | 4.8.1 | 4.10.0 | " | " | " | " | " |
cyrus-sasl | 2.1.27 | X | X | X | X | X | X |
dpdk | 22.11 | " | " | " | " | 23.03 | " |
felix-software | dunedaq-v4.2.0 | " | fddaq-v5.3.0 | " | " | " | " |
fmt* | 8.1.1 | 10.2.1 | " | " | 8.1.1 | 10.2.1 | 11.0.2 |
folly* | 2021.12.13.00 | " | 2024.12.02.00 | " | " | 2021.05.24.00 | 2024.12.02.00 |
gcc | 12.1.0 | " | 13.2.0 | " | " | " | |
gdb | 13.1 | " | 14.1 | " | " | " | |
grpc | 1.65.1 | " | " | " | " | " | |
highfive | 2.7.1 | " | " | " | " | 2.9.0 | 3.0.0-beta1 |
intel-tbb | 2020.3 | " | 2021.9.0 | " | " | " | 2022.0.0 |
krb5 | 1.19.2 | X | X | X | X | X | X |
librdkafka | 1.7.0 | " | " | " | " | 2.2.0 | 2.6.1 |
msgpack-c | 3.3.0 | " | " | " | " | " | 7.0.0 |
ninja | 1.10.0 | " | 1.11.1? | 1.10.0 | " | ||
nlohmann-json | 3.9.0 | 3.11.2 | " | " | " | " | 3.11.3 |
numactl | 2.0.14 | " | " | " | " | " | |
openssh | 8.7p1 | 9.7p1 | " | " | " | " | |
openssl | 1.1.1t | " | " | " | " | 3.3.0 | |
pistache* | dunedaq-v2.8.0 | " | fddaq-v5.3.0 | dunedaq-v2.8.0* | fddaq-v5.3.0* | " | " |
pkgconf | 2.2.0 | " | " | " | " | " | |
protobuf | 4.24.4 | " | " | " | " | " | 29.0 |
pugixml | 1.12.1 | " | " | " | " | 1.13 | |
py-moo | 0.6.7 | " | " | " | " | " | " |
py-pybind11 | 2.6.2 | 2.12.0 | " | " | " | " | 2.13.6 |
python* | 3.10.10 | " | " | " | " | 3.11.7 | |
qt | 5.15.12 | " | " | " | " | " | |
trace | 3.17.14 | " | " | " | " | " | |
uhal | 2.8.1 | " | " | " | " | " | " |
n.b. Pistache version dunedaq-v2.8.0
between NFDT_DEV_241129_A9
and NFDT_DEV_241216_A9
had a patch added so it would build in gcc 13.2.0. Perhaps a bit confusingly, for NFDT_DEV_241230_A9
this patched dunedaq-v2.8.0
got rechristened fddaq-v5.3.0
. I made this decision in order to be consistent with my treatment of felix-software
, where I'd already bumped the version due to a simple gcc 13.2.0
compatibility patch.
n.b. Python 3.11.7
was used for the not-shown-above test build NFDT_DEV_241213_A9
, but caused an immediate failure of drunc
; see later in this document for more.
n.b. Between NFDT_241216_A9
and NFDT_241230_A9
, folly 2024.12.02.00
was rebuilt so that the -mavx2
option was removed and the FOLLY_F14_FORCE_FALLBACK
precompiler #define
was added
n.b. fmt
was reverted to its original version since there was a compatibility issue with dpdklibs
, and its further use is now deprecated anyway
current | NFDT_DEV_241129_A9 | NFDT_DEV_241214_A9 | NFDT_DEV_241216_A9 | NFDT_DEV_241230_A9 | Latest | |
---|---|---|---|---|---|---|
anytree | 2.8.0 | 2.12.1 | " | " | " | " |
click | 8.1.7 | " | " | " | " | " |
click-didyoumean | 0.3.0 | 0.3.1 | " | " | " | " |
click-shell | 2.1 | " | " | " | " | " |
colorama | 0.4.4 | 0.4.6 | " | " | " | " |
deepdiff | 6.3.1 | 8.0.1 | " | " | " | " |
Flask | 2.1.1 | 3.1.0 | " | " | " | " |
Flask-Cors | 3.0.10 | 5.0.0 | " | " | " | " |
Flask-Caching | X | 2.3.0 | " | " | " | " |
Flask-HTTPAuth | 4.6.0 | 4.8.0 | " | " | " | " |
Flask-RESTful | 0.3.9 | 0.3.10 | " | " | " | " |
Flask-SQLAlchemy | X | 3.1.1 | " | " | " | " |
graphviz | 0.16 | 0.20.3 | " | " | " | " |
gunicorn | 20.1.0 | 23.0.0 | " | " | " | " |
h5py | 3.7.0 | 3.12.1 | " | " | " | " |
httpx | 0.23.3 | 0.27.2 | " | " | " | 0.28.0 |
kubernetes | 23.6.0 | 31.0.0 | " | " | " | " |
matplotlib | X | 3.9.2 | " | " | " | 3.9.3 |
numpy | 1.24 | 2.1.3 | " | " | " | " |
pandas | X | 2.2.3 | " | " | " | " |
pexpect | 4.8.0 | 4.9.0 | " | " | " | " |
psutil | 5.9.0 | 6.1.0 | " | " | " | " |
py | 1.10.0 | 1.11.0 | " | " | " | " |
pytest | 8.3.3 | " | " | " | " | 8.3.4 |
python-ipmi | 0.5.1 | 0.5.7 | " | " | " | " |
rsa | 4.8 | 4.9 | " | " | " | " |
sh | 1.14.1 | 2.1.0 | " | " | " | " |
textual | 0.83.0 | 0.87.1 | " | " | " | 0.88.1 |
transitions | 0.8.10 | 0.9.2 | " | " | " | " |
Notes on the package.py
's which have been vendored into daq-release over the years. Looking at the head of develop
of daq-release, d55d8cf3af4
, in spack-repos/externals/packages
:
catch2:
Vendoring occured in commit cf73df3e2
from July 3 this year, and appears related to the update to Spack 0.22.0.
It's unclear why this bump required vendoring.
catch2 doesn't depend on anything.
cetlib-except depends on [email protected]
cetlib depends on catch2
hep-concurrency depends on catch2
cetmodules depends on [email protected] (build-only)
cetlib, cetlib-except, cetmodules:
Vendored because we have no choice
Only fixed dependency is, as described above, [email protected]
cmake:
Vendored so cmake-findprotobuf.patch
can be applied to CMake versions 3.23.1 and up
Having said that, there's considerable difference beyond that between what's in daq-release and what's in builtin
cpr: Can be dropped, no longer need in DUNE DAQ
cyrus-sasl: Can be dropped, apparently superfluous dependency
dpdk: The vendored package.py literally dates from 2021
A commit from May 11, 2022: "JCF: Issue #163: add support for a Spack installation of dpdk"
Can this be dropped? Need to keep in mind things like commit 8eb8dd7d
from earlier this year, where I deal with a libarchive dependency.
felix-software: Obviously has to be vendored. Need to modify so it works with gcc 13.2
fftw: Drop, was only used by dqm
folly: Needs to remain vendored since incredibly, May 2021 is the latest version in builtin. Chesteron's fence? I was able to at least get it to December 2021. And furthermore, the December 2021 doesn't build under gcc 13.1.0.
More details: whereas its dependency, glog, goes up to 0.7.0, it can only build against 0.6.0 and no later because of a complaint about how it includes headers. The December 2024 version also depends on fast-float
, which isn't builtin in Spack, so I've vendored this as well.
grpc: Has to be vendored, builtin only goes up to 1.55.0 and default built is C++11
hep-concurrency:
Not even used. cetlib used to depend on this. Or, it depends on it, but only if there's a ~lite
build. Are we ever going to have that?
highfive:
builtin goes to 2.9.0 while vendored goes to 2.7.1. OTOH, Pengfei added a patch. Also note you needed to add +threadsafe
to hdf5
dependency. Edit this to include the later versions.
lcov: needed for my work
librdkafka:
vendored in March 2022 for unknown reasons (the classic merge proto-spack
commit); openssl dependency added a year later (commit info is add dependency of openssl
). builtin doesn't have openssl dependency but does go up to 2.2.0, vendored only goes to 1.7.0 .
libtorrent: vendored goes to 2.0.9; builtin to 0.13.8
libzmq:
vendored goes to 4.3.4, builtin to 4.3.5. Added entirely in one commit back in August 2022. Builtin has a patch Fix static assertion failure with gcc-13
, not in vendored. Candidate for removal?
msgpack-c:
vendored goes to 3.3.0, builtin to 3.1.1. Obvious keep, the only question is what further versions spack checksum
would give us
openssh:
For externals v2.2 this will only be a dependency of a dependency of git
, a dependency of go
which is a build-only dependency of rclone. But its history (on Slack, etc.) needs revisiting. Chesterton's Fence.
openssl: You vendored this in the switch to spack-0.22.0 back in the summer but it's unclear why.
perl-timedate: needed for my lcov work
pistache:
not a built in, obviously needs to be vendored. Current version used doesn't build in gcc 13.2.0, however, because of missing headers (<cstdint>
, IIRC)
protobuf: Vendored latest is 4.24.4, builtin latest is 3.25.3. Also some bespoke abseil-cpp version logic.
pugixml:
Added by Pengfei in 2022; it may not have existed in builtin. It does now, though note that builtin has 1.3, 1.11.4 and vendored has 1.12.1, 1.12, 1.11.4. Whether or not to remove it from being vendored depends on whether spack spec
picks out 1.3 or not.
py-anyconfig: Added by Pengfei in the original March 2022 commit; doesn't exist in builtin
py-jsonnet: Added by Pengfei in the original March 2022 commit; doesn't exist in builtin
py-fastjsonschema: Added by Pengfei in the original March 2022 commit; identical in builtin
py-sphinxcontrib-moderncmakedomain: Added by Pengfei in the original March 2022 commit; identical in builtin
rclone:
The vendoring of this package appears to be related to the set of rcloneConfig.cmake
-and-related files created for it. Pengfei, September 2023.
trace: Obviously needed
uhal: Not in builtin, obviously needed. Untouched since September 2022.
The minimal_system_quick_test.py
works. However, there are two new messages which you don't see when you run using the regular nightly:
-
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
(really more informational than a warning, in my opinion) - The snippet below is a subset of what you see, this
skipping fork() handlers
message actually appears for all controller and DAQ applications:
I0000 00:00:1734375559.316994 4109005 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
INFO ssh_process_manager.py:299 ssh-process-manager: Booted df-controller uid:
7500052e-7df4-4b65-a8fa-4d97662cb84f
'df-controller' (7500052e-7df4-4b65-a8fa-4d97662cb84f) process started
INFO ssh_process_manager.py:220 ssh-process-manager: Booting user: "jofreema"
session: "minimal"
name: "dfo-01"
tree_id: "0.0.0"
I0000 00:00:1734375559.333264 4109005 fork_posix.cc:75] Other threads are currently calling into gRPC, skipping fork() handlers
INFO ssh_process_manager.py:299 ssh-process-manager: Booted dfo-01 uid: 08b9cc5b-4d3b-4690-b8b3-18191e3d2075
'dfo-01' (08b9cc5b-4d3b-4690-b8b3-18191e3d2075) process started
INFO ssh_process_manager.py:220 ssh-process-manager: Booting user: "jofreema"
session: "minimal"
name: "df-01"
tree_id: "0.0.1"
Despite the message, integration test performance very much comparable with the regular nightlies: https://github.com/DUNE-DAQ/daq-release/actions/runs/12088527284 . One thing, however, that's appeared in the output of listrev_test.py
(but not the tests from integrationtest
are SIGHUP messages (reproduced below). It's not clear this is actually a problem since this always occurs after the data taking is complete and the system has been wound down
Problem(s) found in logfile /tmp/pytest-of-dunedaq/pytest-1998/run4/log_dunedaq_lr-session_local-connection-server.txt:
Error: 2-16 19:35:37 -0600] [4080069] [ERROR] Worker (pid:4080077) was sent SIGHUP!
Here, Python 3.11.7 was used. However, running minimal_system_quick_test.py
there's an immediate failure thanks to the dataclasses
module:
(dbt) [jofreema@np04-srv-019 /nfs/sw/work_dirs/jcfree/NFDT_DEV_241213_A9]$ pytest -s -v $DAQSYSTEMTEST_SHARE/integtest/minimal_system_quick_test.py
======================================================= test session starts =======================================================
platform linux -- Python 3.11.7, pytest-8.3.3, pluggy-1.5.0 -- /nfs/sw/work_dirs/jcfree/NFDT_DEV_241213_A9/.venv/bin/python
cachedir: .pytest_cache
rootdir: /cvmfs/dunedaq-development.opensciencegrid.org/nightly/NFDT_DEV_241213_A9/spack-0.22.0
configfile: pytest.ini
plugins: anyio-4.6.2.post1, integrationtest-3.1.0
collected 0 items / 1 error
============================================================= ERRORS ==============================================================
_ ERROR collecting opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/daqsystemtest-NFDT_DEV_241213_A9-5mfgaaoiauk5tv2cp65bc2fiwiy3vvtc/share/integtest/minimal_system_quick_test.py _
/cvmfs/dunedaq-development.opensciencegrid.org/nightly/NFDT_DEV_241213_A9/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/daqsystemtest-NFDT_DEV_241213_A9-5mfgaaoiauk5tv2cp65bc2fiwiy3vvtc/share/integtest/minimal_system_quick_test.py:6: in <module>
import integrationtest.data_classes as data_classes
<frozen importlib._bootstrap>:1176: in _find_and_load
???
<frozen importlib._bootstrap>:1147: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:690: in _load_unlocked
???
.venv/lib/python3.11/site-packages/_pytest/assertion/rewrite.py:184: in exec_module
exec(co, module.__dict__)
.venv/lib/python3.11/site-packages/integrationtest/data_classes.py:21: in <module>
@dataclass
/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.2/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/python-3.11.7-svtllxgwlzx3niutdi32fxz7wbaapnbs/lib/python3.11/dataclasses.py:1230: in dataclass
return wrap(cls)
/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.2/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/python-3.11.7-svtllxgwlzx3niutdi32fxz7wbaapnbs/lib/python3.11/dataclasses.py:1220: in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.2/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/python-3.11.7-svtllxgwlzx3niutdi32fxz7wbaapnbs/lib/python3.11/dataclasses.py:958: in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.2/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/python-3.11.7-svtllxgwlzx3niutdi32fxz7wbaapnbs/lib/python3.11/dataclasses.py:815: in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
E ValueError: mutable default <class 'integrationtest.data_classes.DROMap_config'> for field dro_map_config is not allowed: use default_factory
===================================================== short test summary info =====================================================
ERROR ../../../../../cvmfs/dunedaq-development.opensciencegrid.org/nightly/NFDT_DEV_241213_A9/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-13.2.0/daqsystemtest-NFDT_DEV_241213_A9-5mfgaaoiauk5tv2cp65bc2fiwiy3vvtc/share/integtest/minimal_system_quick_test.py - ValueError: mutable default <class 'integrationtest.data_classes.DROMap_config'> for field dro_map_config is not allowed: use ...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
======================================================== 1 error in 4.93s =========================================================
The major difference wrt the NFDT_DEV_241213_A9
nightly is that Python gets reverted back to the "classic" 3.10.10
rather than the 3.11.7
given the failures from dataclasses
. The main sticking point this time is that when running minimal_system_quick_test.py
, while the configuration transition goes off without a hitch, the start transition reliably hangs for every process, whether it's a controller process or DAQ process (e.g., mlt, df-01, etc.):
Running transition 'start' on controller 'root-controller'
[16:03:03] ERROR shell_utils.py:138 controller_driver: Command 'execute_fsm_command' failed
on 'mlt' (response flag 'DRUNC_EXCEPTION_THROWN')
and if you looked at an individual process log, whereas you'd see something like this for the configuration transition:
2024-Dec-14 15:51:47,687 LOG [void dunedaq::restcmd::RestEndpoint::handleResponseCommand(const dunedaq::restcmd::cmdobj_t&, dunedaq::cmdlib::cmd::CommandReply&) at /tmp/root/spack-stage/spack-stage-restcmd-NBT_DEV_241214_A9-oz47s7aon65lh4f2ncc6ku4dhn4msxgl/spack-src/src/RestEndpoint.cpp:102] Sending POST request to daq.fnal.gov:59469/response
2024-Dec-14 15:51:47,781 LOG [dunedaq::restcmd::RestEndpoint::handleResponseCommand(const dunedaq::restcmd::cmdobj_t&, dunedaq::cmdlib::cmd::CommandReply&)::<lambda(Pistache::Http::Response)> at /tmp/root/spack-stage/spack-stage-restcmd-NBT_DEV_241214_A9-oz47s7aon65lh4f2ncc6ku4dhn4msxgl/spack-src/src/RestEndpoint.cpp:109] Response code = OK
you'd always see a hang after Sending POST request ...
for the start transition:
2024-Dec-14 15:51:49,809 LOG [void dunedaq::restcmd::RestEndpoint::handleResponseCommand(const dunedaq::restcmd::cmdobj_t&, dunedaq::cmdlib::cmd::CommandReply&) at /tmp/root/spack-stage/spack-stage-restcmd-NBT_DEV_241214_A9-oz47s7aon65lh4f2ncc6ku4dhn4msxgl/spack-src/src/RestEndpoint.cpp:102] Sending POST request to daq.fnal.gov:59469/response
This is what led me to hypothesize that the problem had to do with the Pistache package which restcmd
depends on and which I'd bumped up from its classic, October 2020 commit to a December 2024 commit between NFDT_DEV_241129_A9
and NFDT_DEV_241214_A9
.
Now, another important point: whether it's because of the bump from gcc 12.1.0 to gcc 13.2.0 or the bump of folly from 2021.12.13.00 to 2024.12.02.00, it seems folly refuses to link to a translation unit if it can tell that the translation unit was built with a different set of flags (it uses F14LinkCheck
for this). As a result, since fdreadoutlibs
uses the -mavx2
option to access Advanced Vector Extensions 2, I needed to move this flag into daq-cmake's daq_setup_environment
so that all code would build with it.
As just mentioned, the major change here was to revert Pistache to its October 2020 commit, which we've been using for years. With one catch: I had to add a patch in order for it to build against gcc 13.2.0 rather than 12.1.0. It's the same change I've made in other packages, namely, adding <cXXXXXX>
includes where needed (e.g., <cstdlib>
).
With this change made, the hang went away. E.g., integration tests worked about as well as you'd expect, especially given the recent woes of daq.fnal.gov
: https://github.com/DUNE-DAQ/daq-release/actions/runs/12364412829
Having said that, a couple of things to note:
- Specifically on
protodune-daq01.fnal.gov
, the integration tests (and builds, etc.) don't seem to work. Will be investigated but I wonder if it's related to (1) the jump from gcc 12.1.0 to gcc 13.2.0 and/or (2) the expanded instruction set from the-mavx2
option passed to gcc. On the np04 cluster anddaq.fnal.gov
, this doesn't show up.
The main change here is that now, rather than building folly
+ all the DUNE DAQ packages with the -mavx2
option as was the case for NFDT_DEV_241214_A9
and NFDT_DEV_241216_A9
, I build it with the FOLLY_F14_FORCE_FALLBACK
preprocessor #define
d and set to 1
. I also build the DUNE DAQ packages this way. This allows folly
to link against translation units with different build flags.
Note that with this change, it's possible to build the full DUNE DAQ stack on protodune-daq01.fnal.gov
whereas it wasn't before (in fact, this was done to create NFDU_DEV_241230_A9
, same as NFDT_DEV_241230_A9
except the choice of build machine). The integration tests continue to fail, but get further than under NFDT_DEV_241216_A9
. Whereas with the NFDT_DEV_241216_A9
integration tests on protodune-daq01
nothing even booted (https://github.com/DUNE-DAQ/daq-release/actions/runs/12376009563) now with NFDU_DEV_241216_A9
things boot, but all the applications then crash (https://github.com/DUNE-DAQ/daq-release/actions/runs/12376009563).
Digging down, what I know so far is that if, in a NFDU_DEV_241230_A9
-based work area on protodune-daq01
I try this directly from the command line:
daq_application -s minimal --name mlt -c rest://localhost:0 --configurationService oksconflibs:/tmp/pytest-of-dunedaq/pytest-18/config0/integtest-session-resolved.data.xml
then there's a crash with a complaint about an Illegal Instruction (this is an example of a daq_application
call done during integration tests). Will investigate some more, though it's important to note that the integration tests run fine on np04
and daq.fnal.gov
.