-
Notifications
You must be signed in to change notification settings - Fork 0
Magic Castle EESSI 2023 10 11
Kenneth Hoste edited this page Oct 12, 2023
·
1 revision
- attendees: Thomas, Alan, Kenneth
- test Slurm cluster with Magic Castle running in AWS
- notes in magic-castle-cluster issue #5
- fully configured, working as expected
- 5TB disk space in /project
- EESSI bot is installed & configured in
bot
account by Thomas- works as expected, see test build in software-layer PR
- contributors need to use
repo:eessi-hpc.org-2023.06-software
in build commands - need to replace credentials
- can add
nessibot
account - need to update Slurm there as soon as security update is available
- depends on Slurm RPMs that need to be built by Félix-Antoine
-
dnf update
on login1 + mgmt1 - update node images for x86_64 + aarch64
- auto-update of packages is currently enabled (not using
skip_upgrade
)- updating of packages is only done on boot (so not very relevant for login nodes)
- who should get
sudo
access?- add pubkey to
public_keys
inmain.tf
in right branch
- add pubkey to
- use DNS to make move to new cluster less painful (different IP)
- not having protected branches is annoying
- need GitHub Teams to have protected branches in private repos
- Kenneth will ask Laura/Davide/Hugo if we can leverage Azure sponsored credits somehow
- Kenneth can also ask the GitHub open source community people
- requires GitHub Teams subscription (~$1k/year with current 25 members in EESSI org...)
- could look into a separate EESSI-admins org to reduce cost
- create accounts for active contributors once Slurm is updated
- make sure to keep track of email addresses as well!
- update README in branch with basic info, incl. IP address
- should add scripts for stuff like installing extra packages on login node, node images, to set up the bot, etc.
- we'll spin up a new cluster once Magic Castle 13.x is out
- burn down test Magic Castle set up in AWS
- OK for Alan & Thomas
- need to check for Bob & Lara => empty accounts, so OK
- nothing in
bot
account - good to burn down
- terminate CitC cluster in AWS
- start with disabling bot there (kill screen sessions)
- remove all nodes so no new jobs can be started
- set date to destroy cluster
- inform everyone with an account
- try and get confirmation from everyone that they're OK with having their data removed