-
Notifications
You must be signed in to change notification settings - Fork 0
Sync meeting 2024 05 14
ocaisa edited this page May 15, 2024
·
1 revision
- Monthly, every 2nd Tuesday of the month at 10:00 CE(S)T
- Notes of previous meetings at https://github.com/multixscale/meetings/wiki
- Tue 11 June 2024 10:00 CEST
- Tue 9 July 2024 10:00 CEST
- who is not yet taking vacation then?
attending:
- Neja
- Alan
- Richard
- Bob
- Thomas
- Pedro
- Satish
- Xin
- Julián
- Caspar
- Nadia
project planning overview: https://github.com/orgs/multixscale/projects/1
- overview of MultiXscale planning
- WP status updates
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- [UGent] T1.1 Stable (EESSI) - due M12+M24
- Need to start working on proper monitoring of the CVMFS infrastructure
- Prometheus + Grafana dashboard + alerting?
- healthy state of infrastructure (mostly server-side)
- bandwidth tests
- should start with a list of metrics to collect
- check with what Terje (UiO) has done (status.eessi.io)
- one page for users (notifications about incidents)
- changelog on documentation?
- use yml file for known issues?
- integrate into init script?
- changelog on documentation?
- one admins (Cern has some detection of sites who don't use a proxy)
- setup dedicated meeting
- be clear about what is important to whom (us, EuroHPC, ...)
- Need to start working on proper monitoring of the CVMFS infrastructure
- [RUG] T1.2 Extending support (starts M9, due M30)
- Our Arm Neoverse V1 builds revealed a bug (and, apparently, another one while the developers were trying to fix it) in GROMACS: https://gitlab.com/gromacs/gromacs/-/issues/5057
- started building for
zen4
- may look into AMD GPUs, Neoverse V2, ...
- may also look into Clang and MPICH
- [SURF] T1.3 Test suite - due M12+M24
- Espresso test MultiXscale added (WIP). Deadline Milestone: End of June. #144
- CP2K #133, LAMMPS #131, PyTorch #130 and QE #128.
- Fixed process binding within the test-suite which was not really compact. #137
- Certain small fixes:
- OpenFOAM may not be relevant w.r.t. MultiXscale anymore but still relevant within EESSI and development is going on for a test.
- A repository for saving large input files such as meshes needed for the test.
- Kenneth's suggestion: S3 bucket AWS?
- Skip certain tests to save time in build jobs particularly
- use some lookup table
- analyse contents of tarball
- [BSC] T1.4 RISC-V (starts M13)
- Development repository
riscv.eessi.io
- Documentation: https://www.eessi.io/docs/repositories/riscv.eessi.io/
- Prerequisistes have been made available: CernVM-FS client, build containers, RISC-V support in compatibility layer installation scripts, etc
- Compatibility layer available in
/cvmfs/riscv.eessi.io/versions/20240402/compat/linux/riscv64/
- Working on software layer (manually, no bot involved yet):
- Notes in https://github.com/EESSI/software-layer/issues/552
- Only
riscv64/generic
for now - Solving lots of issues with easyconfigs, mostly adding/enabling/backporting RISC-V support
- Managed to install foss/2023b toolchain, now trying real software on top of it:
- Successfully built R 4.3.3 and dlb 3.4
- Currently trying GROMACS, which compiles, but fails in the test step (1 of 91 tests fails with segmentation fault)
- Clang is needed/provides better support for RISC-V (BSC, SiPearl)
- Development repository
- [SURF] T1.5 Consolidation (starts M25)
- [UGent] T1.1 Stable (EESSI) - due M12+M24
- [UGent] WP5 Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations
- FINISHED [UGent] T5.1 Support portal - due M12
- [SURF] T5.2 Monitoring/testing (starts M9)
- UiB: ongoing work to use test-suite on national HPC systems in Norway + low-level CVMFS availability tests (likely 2 stages: 1st simple test, 2nd adding feature to Slurm which is only set when EESSI is available on node + jobs can request that feature)
- or even better, only start CVMFS if it is requested by job
- Initial meeting to discuss public dashboard: https://github.com/EESSI/meetings/wiki/meeting-public-dashboard-2024-05-03
- next meeting planned for mid-June
- UiB: ongoing work to use test-suite on national HPC systems in Norway + low-level CVMFS availability tests (likely 2 stages: 1st simple test, 2nd adding feature to Slurm which is only set when EESSI is available on node + jobs can request that feature)
- FINISHED [UiB] T5.3 community contributions (bot) - due M12
- [UGent] T5.4 support/maintenance (starts M13)
- working rotation, something noteworthy?
- rotation schedule until October agreed
- bot release around the corner
- working rotation, something noteworthy?
- [UB] WP6 Community outreach, education, and training
- Lots of EESSI/MultiXscale activity at ISC as we speak
- UiB: preparing presentation "Making it EESSI to run bioinformatics workflows" at Norwegian Bioinformatics days (workshop about data management), May 29
- nextflow repository, uses .direnv (see https://github.com/EESSI/eessi-nextflow-example)
- UiB: preparing webinar introducing EESSI/NESSI to users on national HPCs, date:tbd
- also market this to NCC (ask Castiel2 for budget if in-person)
- discussion within scientific WPs about trainings to offer
- series of webinars with CECAM
- application to CECAM for a flagship course
- should we look into repeating EESSI-only-related tutorials
- Alan finalising dates with two NCCs (Austria/Slovenia) and two CECAM nodes (running MPI application, running GPU application)
- instructor training with NCC Sweden (about how to prepare and deliver a lecture/tutorial)
- one NCC Slovenia event planned in December, Slovenia Supercomputer Days
- [HPCNow] WP7 Dissemination, Exploitation & Communication
- Task 7.1 Scientific applications provisioned on demand
- Initial discussion with HPCNow but we need a dedicated meeting
- Task 7.2 - Dissemination and communication activities
- Overlap with previous discussion in WP6
- ASHPC in June (Matej is Program Chair)
- MultiXscale poster
- ESPResSO workshop currently being disseminated
- Includes waLBerla
- Can disseminate in CASTIEL2 newsletter (used to be NCC only but now includes CoEs)
- Website needs some updating based on review feedback
- Task 7.3 - Sustainability (NIC + HPCNow!)
- due to start in June
- Legal entity for EESSI needs to be looked into
- Task 7.4 - Industry-oriented training activities (HPCNow and Leonardo)
- Subject of a meeting next week
- Task 7.1 Scientific applications provisioned on demand
- [NIC] WP8 (Management and Coordination)
- something about the review? 😱
- Working on a response to the letter
- 2 additional deliverables, one relevant to us on co-design
- Could be good to focus on Clang and work with vendors to help them deliver/test their customisations
- Should also start looking at Neoverse-V2 (NVIDIA GRACE has this)
- Connect with
- [SURF] WP1 Developing a Central Platform for Scientific Software on Emerging Exascale Technologies
- Decision from EuroHPC for CI/CD call is out
- Requested to collaborate more with CASTIEL2
- Can substitute in a technical collaboration task focussing on CI/CD
- Test suite is developing at a decent pace. Can be better w.r.t. applications, such as mid level software such as BLAS libraries etc.
- Getting and displaying scaling information from the reported performance numbers.
- We had an initial meeting w.r.t. the dashboard but some urgent work is required and going on since we are already 7 months into the task.
- Next meeting planned mid-June.
- Working towards a prototype with already existing data.
- Maksim is already testing various Databases where the performance logs can be collected.
WP5 (Building, Supporting and Maintaining a Central Shared Stack of Optimized Scientific Software Installations)
- ...
- ...
- ...