Skip to content

Magic Castle EESSI 2023 10 11

Kenneth Hoste edited this page Oct 12, 2023 · 1 revision

Magic Castle clusters for EESSI


Sync meeting 2023-10-11 (13:00 CEST)

  • attendees: Thomas, Alan, Kenneth
  • test Slurm cluster with Magic Castle running in AWS
    • notes in magic-castle-cluster issue #5
    • fully configured, working as expected
      • 5TB disk space in /project
    • EESSI bot is installed & configured in bot account by Thomas
      • works as expected, see test build in software-layer PR
      • contributors need to use repo:eessi-hpc.org-2023.06-software in build commands
      • need to replace credentials
    • can add nessibot account
    • need to update Slurm there as soon as security update is available
      • depends on Slurm RPMs that need to be built by Félix-Antoine
      • dnf update on login1 + mgmt1
      • update node images for x86_64 + aarch64
    • auto-update of packages is currently enabled (not using skip_upgrade)
      • updating of packages is only done on boot (so not very relevant for login nodes)
    • who should get sudo access?
      • add pubkey to public_keys in main.tf in right branch
    • use DNS to make move to new cluster less painful (different IP)
    • not having protected branches is annoying
      • need GitHub Teams to have protected branches in private repos
      • Kenneth will ask Laura/Davide/Hugo if we can leverage Azure sponsored credits somehow
      • Kenneth can also ask the GitHub open source community people
      • requires GitHub Teams subscription (~$1k/year with current 25 members in EESSI org...)
        • could look into a separate EESSI-admins org to reduce cost
    • create accounts for active contributors once Slurm is updated
      • make sure to keep track of email addresses as well!
    • update README in branch with basic info, incl. IP address
    • should add scripts for stuff like installing extra packages on login node, node images, to set up the bot, etc.
  • we'll spin up a new cluster once Magic Castle 13.x is out
  • burn down test Magic Castle set up in AWS
    • OK for Alan & Thomas
    • need to check for Bob & Lara => empty accounts, so OK
    • nothing in bot account
    • good to burn down
  • terminate CitC cluster in AWS
    • start with disabling bot there (kill screen sessions)
    • remove all nodes so no new jobs can be started
    • set date to destroy cluster
    • inform everyone with an account
    • try and get confirmation from everyone that they're OK with having their data removed

Previous meetings

Clone this wiki locally