-
-
Notifications
You must be signed in to change notification settings - Fork 3
ClearML Agent Setup
mshannon-sil edited this page Dec 17, 2024
·
5 revisions
- Log in as root
sudo -i
- Install the GPU driver
ubuntu-drivers install
- Install Docker
- Install NVIDIA Container Toolkit and configure NVIDIA Container Runtime for Docker
- If using MIG partitions:
- Install nvidia-mig-manager.service
- Configure
/etc/nvidia-mig-manager/config.yaml
with your MIG configuration, e.g.version: v1 mig-configs: sil-config: - devices: [0] mig-enabled: true mig-devices: 3g.47gb: 2 - devices: [1] mig-enabled: false mig-devices: {}
- Configure
/etc/systemd/system/nvidia-mig-manager.service.d/override.conf
to use your mig-config, e.g.[Service] Environment="MIG_PARTED_SELECTED_CONFIG=sil-config"
- Run
nvidia-mig-parted apply
or reboot the server
- Create clearml user
adduser clearml
- Add clearml user to docker group: https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user
- Log in as clearml user. IMPORTANT: Do not create/modify any files in the clearml user directory as root.
su - clearml
- Install clearml-agent
pip install clearml-agent
- Add a clearml.conf file
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under
scripts/clearml_agent/clearml.conf
and fill out the ClearML credentials, git credentials, worker id, and worker name sections. Also add the following lines to the extra_docker_arguments section and fill out the access key and secret access key sections.extra_docker_arguments: [ "--env","SIL_NLP_DATA_PATH=/silnlp" "--env","AWS_REGION=us-east-1", "--env","AWS_ACCESS_KEY_ID=***your access key***", "--env","AWS_SECRET_ACCESS_KEY=***your secret key***", "--env","TOKENIZERS_PARALLELISM=false", "-v","/home/clearml/.clearml/hf-cache:/root/.cache/huggingface" ]
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under
- Create a startup script called start-agents.sh, e.g.
!/bin/sh # Kill all clearml-agents running ps -A | grep clearml-agent | awk '{print $1}' | xargs kill -9 $1 # GPU 0 /home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 0:0 --queue 47gb_queue /home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 0:1 --queue 47gb_queue # GPU 1 /home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 1 --queue 94gb_queue
- Start the agents
./start-agents.sh
- To configure the GPUs to survive a reboot:
- Become the root user again
exit
- Create a file called clearml-agent in /etc/init.d/ directory, e.g.
- Become the root user again
#!/bin/sh
set -e
### BEGIN INIT INFO
# Provides: clearml-agents
# Required-Start: $syslog $remote_fs $local_fs $syslog mountall
# Required-Stop: $syslog $remote_fs $local_fs $syslog
# Should-Start:
# Should-Stop:
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: ClearML Agents and queues to service GPUs
# Description:
# "ClearML is an open source platform that automates and simplifies
# developing and managing machine learning solutions. ClearML Agent
# is a virtual environment and execution manager for DL/ML solutions
# on GPU machines." --https://clear.ml
### END INIT INFO
export PATH=/sbin:/bin:/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin
NAME="clearml-agents"
# Get lsb functions
. /lib/lsb/init-functions
fail_unless_root() {
if [ "$(id -u)" != '0' ]; then
log_failure_msg "$NAME must be run as root"
exit 1
fi
}
do_start_stop() {
STOP=""
if [ "$1" = "stop" ]; then
STOP="--stop"
fi
# Half GPUs 0:0 and 0:1 and Full GPU 1
su --login --command "/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 0:0 --queue cheetah_47gb ${STOP}" clearml
su --login --command "/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 0:1 --queue cheetah_47gb ${STOP}" clearml
su --login --command "/home/clearml/.local/bin/clearml-agent daemon --use-owner-token --detached --docker --force-current-version --create-queue --gpus 1 --queue cheetah_94gb ${STOP}" clearml
}
case "$1" in
start)
fail_unless_root
log_begin_msg "Starting $NAME"
do_start_stop
log_end_msg $?
;;
stop)
fail_unless_root
do_start_stop "stop"
;;
restart)
fail_unless_root
do_start_stop "stop"
do_start_stop
;;
status)
ps -ef | head -1
ps -ef | grep clearml-agent | grep -v grep
;;
*)
echo "Usage: service clearml-agents {start|stop|restart|status}"
exit 1
;;
esac
- Install Miniconda
- Clone and enter the SILNLP repo
git clone https://github.com/sillsdev/silnlp.git cd silnlp
- Create a new conda environment using the environment.yml file in the repo
conda env create --file environment.yml
- Activate the conda environment
conda activate silnlp
- Install Poetry with the official installer, not pipx
- Make sure to install the version that matches the one listed at the top of the poetry.lock file in SILNLP.
- Poetry must be installed after the conda environment is activated so that it uses the correct Python version.
- Double check that Poetry has been added to the path. You may need to restart the terminal, but make sure to activate the silnlp conda environment again upon reentering the terminal.
- Install clearml-agent-slurm
pip3 install -U --extra-index-url https://*****@*****.allegro.ai/repository/clearml_agent_slurm/simple "clearml-agent-slurm==0.4.0"
- The credentials can be found by clicking on the question mark in the upper right corner of the ClearML dashboard, then clicking ClearML Python Package setup and copying the credentials in step 1.
- Add a clearml.conf file
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under
scripts/clearml_agent/clearml.conf
and fill out the ClearML credentials, git credentials, worker id, and worker name.
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under
- Set environment variables in
.bashrc
export PYTHONPATH= export AWS_REGION="us-east-1" export AWS_ACCESS_KEY_ID=***your access key*** export AWS_SECRET_ACCESS_KEY=***your secret key*** export SIL_NLP_DATA_PATH="/silnlp" export TOKENIZERS_PARALLELISM=false
- Create a batch template file called
slurm.clearml.template
- You'll need to update the
--account
and--partition
parameters for your use case in the example below
- You'll need to update the
#!/bin/bash
# available template variables (default value separator ":")
# ${CLEARML_QUEUE_NAME}
# ${CLEARML_QUEUE_ID}
# ${CLEARML_WORKER_ID}.
# complex template variables (default value separator ":")
# ${CLEARML_TASK.id}
# ${CLEARML_TASK.name}
# ${CLEARML_TASK.project.id}
# ${CLEARML_TASK.hyperparams.properties.user_key.value}
# example
#SBATCH --job-name=clearml_task_${CLEARML_TASK.id} # Job name DO NOT CHANGE
#SBATCH --output=task-${CLEARML_TASK.id}-%j.log
#SBATCH --account ***your account name***
#SBATCH --partition ***partition to use***
#SBATCH --time=${CLEARML_TASK.hyperparams.properties.time_limit.value:18:00:00} # Time limit hrs:min:sec
#SBATCH --nodes=1
conda activate silnlp
${CLEARML_PRE_SETUP}
echo whoami $(whoami)
${CLEARML_AGENT_EXECUTE}
${CLEARML_POST_SETUP}
- Start the agent
nohup clearml-agent-slurm --template-files slurm.clearml.template --queue ***queue_name***
- Press
Ctrl + Z
to suspend the process - Move the process to the background
bg
- Log in as root
sudo -i
- Create clearml user
adduser clearml
- Log in as clearml user
su - clearml
- Install and initialize Miniconda
- Clone and enter the SILNLP repo
git clone https://github.com/sillsdev/silnlp.git cd silnlp
- Create a new conda environment using the environment.yml file in the repo
conda env create --file environment.yml
- Activate the conda environment
conda activate silnlp
- Install Poetry with the official installer, not pipx
- Make sure to install the version that matches the one listed at the top of the poetry.lock file in SILNLP.
- Poetry must be installed after the conda environment is activated so that it uses the correct Python version.
- Double check that Poetry has been added to the path. You may need to restart the terminal, but make sure to activate the silnlp conda environment again upon reentering the terminal.
- Install clearml-agent
pip install clearml-agent
- Add a clearml.conf file
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under
scripts/clearml_agent/clearml.conf
and fill out the ClearML credentials, git credentials, worker id, worker name, and python binary (use the conda python path).
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under
- Set environment variables in .bashrc
export PYTHONPATH= export AWS_REGION="us-east-1" export AWS_ACCESS_KEY_ID="***your access key***" export AWS_SECRET_ACCESS_KEY="***your secret key***" export SIL_NLP_DATA_PATH="/silnlp" export TOKENIZERS_PARALLELISM=false
- Create a startup script called start-agents.sh, e.g.
#!/bin/sh # Kill all clearml-agents running ps -A | grep clearml-agent | awk '{print $1}' | xargs kill -9 $1 # GPU 0 /home/clearml/miniconda3/envs/silnlp/bin/clearml-agent daemon --use-owner-token --detached --create-queue --gpus 0 --queue 24gb_queue
- Start the agents
./start-agents.sh
- Configure agents to restart on reboot
- Follow the corresponding instructions to restart agents on reboot in the On Linux w/Docker section, and modify the clearml-agent commands enclosed in quotation marks in the script to match with the clearml-agent command in your
start-agent.sh
script
- Follow the corresponding instructions to restart agents on reboot in the On Linux w/Docker section, and modify the clearml-agent commands enclosed in quotation marks in the script to match with the clearml-agent command in your
- Install Miniconda
- Clone and enter the SILNLP repo
git clone https://github.com/sillsdev/silnlp.git cd silnlp
- Create a new conda environment using the environment.yml file in the repo
conda env create --file environment.yml
- Activate the conda environment
conda activate silnlp
- Follow these instructions to disable Git Credential Manager for Windows
- Install clearml-agent
pip install clearml-agent
- Install pywin32
pip install pywin32
- Install poetry with the official installer, not pipx
- Make sure to install the version that matches the one listed at the top of the poetry.lock file in SILNLP.
- Poetry must be installed after the conda environment is activated so that it uses the correct Python version.
- Double check that Poetry has been added to the path. You may need to restart the terminal, but make sure to activate the silnlp conda environment again upon reentering the terminal.
- Add a clearml.conf file
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under
scripts/clearml_agent/clearml.conf
and fill out the ClearML credentials, git credentials, worker id, worker name, and python binary (use the conda python path).
- Either copy it from an existing setup, or use the skeleton provided in the SILNLP repo under
- Set the following environment variables
setx AWS_REGION "us-east-1" setx AWS_ACCESS_KEY_ID "***your access key***" setx AWS_SECRET_ACCESS_KEY "***your secret key***" setx SIL_NLP_DATA_PATH "/silnlp" setx TOKENIZERS_PARALLELISM "false"
- Create a
start_agents.bat
script- There is no --detached option since it's not supported on Windows.
- Replace
<username>
with the name of the user running the clearml agent.
@echo off REM Kill all clearml-agent processes running for /f "tokens=2" %%i in ('tasklist /FI "IMAGENAME eq python.exe" /FO LIST ^| findstr clearml-agent') do ( echo Killing clearml-agent with PID %%i taskkill /PID %%i /F ) REM GPU 0 C:\Users\<username>\miniconda3\envs\silnlp\Scripts\clearml-agent daemon --use-owner-token --create-queue --gpus 0 --queue 24gb_queue
- Run the script
start-agents.bat
- Troubleshooting
- If you get import errors such as
ImportError: cannot import name 'ssl' from 'urllib3.util.ssl_'
orImportError: DLL load failed while importing _sqlite3: The specified module could not be found.
, you need to add theDLLs
andLibrary/bin
folders inside your conda environment folder to the Path and/or copylibcrypto-1_1-x64.dll
,libssl-1_1-x64.dll
, andsqlite3.dll
from theLibrary/bin
folder to theDLLs
folder.
- If you get import errors such as