Skip to content

Sync meeting 2024 05 14

ocaisa edited this page May 15, 2024 · 1 revision

MultiXscale WP1+WP5 sync meetings


Next meetings

  • Tue 11 June 2024 10:00 CEST
  • Tue 9 July 2024 10:00 CEST
    • who is not yet taking vacation then?

Agenda/notes 2024-05-14

attending:

  • Neja
  • Alan
  • Richard
  • Bob
  • Thomas
  • Pedro
  • Satish
  • Xin
  • Julián
  • Caspar
  • Nadia

project planning overview: https://github.com/orgs/multixscale/projects/1

Notes

  • overview of MultiXscale planning
  • WP status updates
    • [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
      • [UGent] T1.1 Stable (EESSI) - due M12+M24
        • Need to start working on proper monitoring of the CVMFS infrastructure
          • Prometheus + Grafana dashboard + alerting?
          • healthy state of infrastructure (mostly server-side)
          • bandwidth tests
          • should start with a list of metrics to collect
          • check with what Terje (UiO) has done (status.eessi.io)
          • one page for users (notifications about incidents)
            • changelog on documentation?
              • use yml file for known issues?
            • integrate into init script?
          • one admins (Cern has some detection of sites who don't use a proxy)
        • setup dedicated meeting
          • be clear about what is important to whom (us, EuroHPC, ...)
      • [RUG] T1.2 Extending support (starts M9, due M30)
        • Our Arm Neoverse V1 builds revealed a bug (and, apparently, another one while the developers were trying to fix it) in GROMACS: https://gitlab.com/gromacs/gromacs/-/issues/5057
        • started building for zen4
        • may look into AMD GPUs, Neoverse V2, ...
        • may also look into Clang and MPICH
      • [SURF] T1.3 Test suite - due M12+M24
        • Espresso test MultiXscale added (WIP). Deadline Milestone: End of June. #144
        • CP2K #133, LAMMPS #131, PyTorch #130 and QE #128.
        • Fixed process binding within the test-suite which was not really compact. #137
        • Certain small fixes:
          • Renaming of 1_cpn_2_nodes tags #140
          • set SRUN_CPUS_PER_TASK (needed on SLURM >= 22.05 < 23.11) #141
          • Temporary fix for libfabric problems on Karolina #142.
        • OpenFOAM may not be relevant w.r.t. MultiXscale anymore but still relevant within EESSI and development is going on for a test.
          • A repository for saving large input files such as meshes needed for the test.
          • Kenneth's suggestion: S3 bucket AWS?
        • Skip certain tests to save time in build jobs particularly
          • use some lookup table
          • analyse contents of tarball
      • [BSC] T1.4 RISC-V (starts M13)
        • Development repository riscv.eessi.io
        • Documentation: https://www.eessi.io/docs/repositories/riscv.eessi.io/
        • Prerequisistes have been made available: CernVM-FS client, build containers, RISC-V support in compatibility layer installation scripts, etc
        • Compatibility layer available in /cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/
        • Working on software layer (manually, no bot involved yet):
          • Notes in https://github.com/EESSI/software-layer/issues/552
          • Only riscv64/generic for now
          • Solving lots of issues with easyconfigs, mostly adding/enabling/backporting RISC-V support
          • Managed to install foss/2023b toolchain, now trying real software on top of it:
            • Successfully built R 4.3.3 and dlb 3.4
            • Currently trying GROMACS, which compiles, but fails in the test step (1 of 91 tests fails with segmentation fault)
          • Clang is needed/provides better support for RISC-V (BSC, SiPearl)
      • [SURF] T1.5 Consolidation (starts M25)
    • [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
      • FINISHED [UGent] T5.1 Support portal - due M12
      • [SURF] T5.2 Monitoring/testing (starts M9)
        • UiB: ongoing work to use test-suite on national HPC systems in Norway + low-level CVMFS availability tests (likely 2 stages: 1st simple test, 2nd adding feature to Slurm which is only set when EESSI is available on node + jobs can request that feature)
          • or even better, only start CVMFS if it is requested by job
        • Initial meeting to discuss public dashboard: https://github.com/EESSI/meetings/wiki/meeting-public-dashboard-2024-05-03
        • next meeting planned for mid-June
      • FINISHED [UiB] T5.3 community contributions (bot) - due M12
      • [UGent] T5.4 support/maintenance (starts M13)
        • working rotation, something noteworthy?
          • rotation schedule until October agreed
        • bot release around the corner
    • [UB] WP6 Community outreach, education, and training
      • Lots of EESSI/MultiXscale activity at ISC as we speak
      • UiB: preparing presentation "Making it EESSI to run bioinformatics workflows" at Norwegian Bioinformatics days (workshop about data management), May 29
      • UiB: preparing webinar introducing EESSI/NESSI to users on national HPCs, date:tbd
        • also market this to NCC (ask Castiel2 for budget if in-person)
      • discussion within scientific WPs about trainings to offer
        • series of webinars with CECAM
        • application to CECAM for a flagship course
      • should we look into repeating EESSI-only-related tutorials
        • Alan finalising dates with two NCCs (Austria/Slovenia) and two CECAM nodes (running MPI application, running GPU application)
        • instructor training with NCC Sweden (about how to prepare and deliver a lecture/tutorial)
      • one NCC Slovenia event planned in December, Slovenia Supercomputer Days
    • [HPCNow] WP7 Dissemination, Exploitation & Communication
      • Task 7.1 Scientific applications provisioned on demand
        • Initial discussion with HPCNow but we need a dedicated meeting
      • Task 7.2 - Dissemination and communication activities
        • Overlap with previous discussion in WP6
        • ASHPC in June (Matej is Program Chair)
          • MultiXscale poster
        • ESPResSO workshop currently being disseminated
          • Includes waLBerla
          • Can disseminate in CASTIEL2 newsletter (used to be NCC only but now includes CoEs)
        • Website needs some updating based on review feedback
      • Task 7.3 - Sustainability (NIC + HPCNow!)
        • due to start in June
        • Legal entity for EESSI needs to be looked into
      • Task 7.4 - Industry-oriented training activities (HPCNow and Leonardo)
        • Subject of a meeting next week
    • [NIC] WP8 (Management and Coordination)
      • something about the review? 😱
      • Working on a response to the letter
      • 2 additional deliverables, one relevant to us on co-design
        • Could be good to focus on Clang and work with vendors to help them deliver/test their customisations
        • Should also start looking at Neoverse-V2 (NVIDIA GRACE has this)
        • Connect with

CASTIEL2

  • Decision from EuroHPC for CI/CD call is out
  • Requested to collaborate more with CASTIEL2
    • Can substitute in a technical collaboration task focussing on CI/CD

Overview progress per WP

WP1 (Developing a Central Platform for Scientific Software on Emerging Exascale Technologies)

  • Test suite is developing at a decent pace. Can be better w.r.t. applications, such as mid level software such as BLAS libraries etc.
  • Getting and displaying scaling information from the reported performance numbers.
  • We had an initial meeting w.r.t. the dashboard but some urgent work is required and going on since we are already 7 months into the task.
    • Next meeting planned mid-June.
    • Working towards a prototype with already existing data.
    • Maksim is already testing various Databases where the performance logs can be collected.

WP5 (Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations)

  • ...

WP6 (Community outreach, education, and training)

  • ...

WP7 (Dissemination, Exploitation & Communication)

  • ...
Clone this wiki locally