Skip to content

meeting 2025 01 09

Kenneth Hoste edited this page Jan 14, 2025 · 2 revisions

Notes for 2025-01-09 meeting

  • date & time: Thu 9 Jan 2025 - 14:00 CET (13:00 UTC)
    • (every first Thursday of the month)
  • venue: (online, see mail for meeting link, or ask in Slack)
  • agenda:
    • Quick introduction by new people
    • EESSI-related meetings and events in last month(s)
    • Progress update per EESSI layer
    • Update on EESSI production repository software.eessi.io
    • Update on EESSI test suite + build-and-deploy bot
    • Integration of EESSI in EuroHPC Federation Platform
    • AWS/Azure sponsorship update
    • Upcoming/recent events
    • Q&A

Slides

Meeting notes

(by Bob/Pedro)

Quick introduction by new people

  • Christian Kniep
    • "One of the original gangsters"
    • Used to work AWS and Docker and others
    • Now working for MemVerge
    • Containers for AI factories
    • Interested in packaging parts of EESSI into containers
  • E4 company
    • HPC system integrator company from Italy
    • Interested in EESSI because it allows them to provide customers with a software stack
  • Jannetta Steyn
    • Newcastle University
    • Interested in running the EESSI stack on Raspberry Pi's

HPCwire 2024 Readers' Choice Awards

(see slides)

  • Thank you very much for voting on EESSI! 🙏

Progress update per EESSI layer

Filesystem layer

(see slides)

  • Pull request that added capability of handling subdirectories for installed software, needed for dev.eessi.io

    • Check dev.eessi.io documentation for more details
  • New (minor) CernVM-FS release, major release expected soon

    • Should/can we update it? Usually the procedure for upgrading to a new major release is carefully described.
    • Clients wouldn´t need to update as older versions are still supported (up to a point). Something to look into: how old a version of CVMFS clients can we support?
Compatibility layer

(see slides) Upcoming new EESSI version. This includes a new version of the compatibility layer.

  • Newer toolchains are tricky to install with the current compatibility layer, e.g. due to the OpenSSL version
  • Bot can build new compatibility layer 🎉
  • Probably will wait until Feb to include newer glibc
  • Test new compat layer by building some software with it and the newer toolchain.
  • Discussion on if/when EESSI versions should be archived.
Software layer

(see slides)

  • Bunch of new software has been added to the repository

    • Slightly less than before due to the holidays and end-of-year deadlines
  • Limited CPU + GPU combinations plus large reservations for GPU nodes make AWS and Azure trickier for GPU builds.

    • Moving GPU builds to service accounts on partner systems, well received by site system administrators. Set-up and configuration in progress.
    • Service account now available for A64FX in Deucalion (Portugal). This means multiple people can work on A64FX bulds.
  • BSC (Spain) RISC-V service account for RISC-V builds. Necessary to access hardware needed for builds.

software.eessi.io repository

(see slides)

  • Zen4 support is nearly ready/complete
    • Older toolchains cannot be installed, and warnings will be printed when you try to load modules depending on these toolchain
    • Other than that, it's on par with other micro-architectures

Build-and-deploy bot

(see slides)

  • Support for setting env vars that get set before the build starts. The main use case is to set whitelisted EasyBuild parameters that might be necessary in a given build. Variable - value pairs need to be whitelisted in the bot configuration for the bot to respond to them.

  • Discussion on handling build job submissions by bots -> there are more concurrent bot instances configured for many architectures. Care not to start/inject duplicate builds by accident etc.

Monitoring

(see slides)

  • CVMFS infrastructure overview grafana dashboard that gives a more zoomed in view of the service status -> useful to trace back after Slack monitoring alerts
  • Network traffic information
  • Monitoring has already picked up short network outages in the Stratum 0 data center
  • Dead Man's snitch already warned about the alert manager not working once, which was quickly acted upon.
  • Suggestion: monitoring of Stratum 1 -> Stratum 0 connection. Might already be in place (the status page may already check for this)

EESSI documentation

(see slides)

  • Please let us know (by email / Slack / opening a PR) if you know about additional systems where EESSI is available!

EESSI test suite

(see slides)

  • The eessi_mixin class should make it easy to write portable tests, documentation is now available
  • It should be even easier now to run the EESSI test suite on top of a local module stack

Integration of EESSI in EuroHPC Federation Platform (EFP)

(see slides)

  • Started 1 Jan 2025. Ghent University HPC team is a partner and responsible for integrating EESSI in the Federation Platform. Other Projects/tools are used for other components of the platform (see slides).
  • Updates when relevant in these meetings and in other channels.
  • Note: The EuroHPC Federation Platform is not funding EESSI or it's development

AWS/Azure sponsored credits

(see slides)

  • AWS Block storage (EBS) is a big component of the consumed credits, this can probably be optimised.
    • We should look into this and keep track of the disk usage
  • Azure December graph is misleading because we redeployed the slurm cluster for zen4, the accounting is done in a different way. Graphic needs to be corrected, results should be close to those in November.

Governance

(see slides)

  • Suggestions / comments / ideas are welcome. Contact one of the members of the interim Steering Committee.

Events

(see slides)

Q&A

  • next meetings: March 6, May 8 (!)
Clone this wiki locally