This repository contains all the information you need to get started running Python and Jupyter Lab on the CPUs and GPUs available on WashU's RIS Cluster
First, you will need to get access to the RIS Cluster. Each faculty gets 5TB of free storage on the RIS Storage Cluster and access to open CPUs and GPUs on the Compute Cluster. If you're a student, you will simply be assigned access to your advisor's resources. To request access, submit a ticket at https://ris.wustl.edu/support/service-desk/. This process will probably take a few days, and you'll need your advisor to oversee the process.
To access the RIS cluster, you will need to perform the following steps.
- If you are off-campus, you will need to be on the WashU VPN. Instructions on how to do so can be found at https://it.wustl.edu/items/connect/.
- You should open a terminal on your computer, and SSH into the compute environment with the command
ssh <wustl-key>@compute1-client-<N>.ris.wustl.edu
where<key>
is your official WUSTL key, and<N>
can be any number from 1 to 4- e.g.
[email protected]
- It should be noted that despite being called
compute1
, this is a "login" node, not a "compute" node. This means that you should not run any complicated code, you should instead use a "compute" node to run your code, using the steps either in Task 3, Task 11, or Task 12.
- e.g.
The only reason you needed to SSH into the compute environment was to set up your account for the first time. At this point, I will be recommending that most people use Open OnDemand to access their code, rather than SSH. If you want to use the terminal, skip ahead to Task 11.
Now that you have SSHed into your account, you have access to OpenOnDemand. This is a platform managed by RIS which gives you access to Jupyter Lab, RStudio, Matlab, and other computational tools you might use. To get started, do the following steps.
- Go to https://ood.ris.wustl.edu/
- To create a Jupyter Lab instance, click on the
Interactive Apps → Jupyter Notebook
button at the top of the page. - On the resulting page, fill out the following parameters:
Mounts
: You should paste in the following text:/scratch1/fs1/<faculty-id>:/scratch1/fs1/<faculty-id> /storage1/fs1/<faculty-id>/Active:/storage1/fs1/<faculty-id>/Active
.- When you first got access to RIS, you should have been told what your advisor's
<faculty-id>
is. This is not necessarily the same as their WUSTL Key. - If you are using the ArtSci compute conda instead of a faculty's storage, you might not have access to
/scratch1
. Instead, you should probably use this as your input:/storage1/fs1/artsci/Active/<wustl-key>:/storage1/fs1/artsci/Active/<wustl-key>
- When you first got access to RIS, you should have been told what your advisor's
Job Group
: You can leave the default option as is, which should be something like<wustl-key>/ood
User Group
: You should select the dropdown option which has your advisor's<faculty-key>
, which should be something likecompute-<faculty-key>
Queue
: You should usegeneral
.general-interactive
is also an option, but I haven't tested it with GPUs.SLA Name
: Leave this empty unless your advisor tells you to put something hereMemory
: Request as much memory as you think you'll need. If you request too much memory which isn't available on the cluster, your job will be stuck in a pending state.16GB
is a reasonable amount of memory to start with. If you find that your code keeps crashing due to lack of memory, you can increase this amount as needed.Number of Hours
: How long your Jupyter Instance will run before being automatically killed by the server. You can request up to672
hours (i.e. 28 days), but please don't just leave a job running if you're not making use of the resources, you're taking away the resources from somebody else.GPUs to Allocate
: If your code makes use of GPUs, put1
here. If your code doesn't use GPUs, put0
. You could request up to4
in this slot, but unless you are really smart with your code, you will probably not be able to make use of more than one GPU at a time.GPU Model Type
: You should useTesla V100 (32GB SXM2)
.NVIDIA A40
is potentially an option, but I have not tested this myself.Number of Processors
: This represents the number of CPUs assigned to your job. If your code doesn't make use of multiprocessing, you will not be able to make use of more than1
or2
. If you do make use of multiprocessing, you can put up to32
here, but the higher the number, the longer it may take for your job to land.Enable MPI
: Don't check this unless your code requiresMPI
Use JupyterLab
: Click this, Jupyter Lab is a much nicer interface than Jupyter NotebookCustom Jupyter Notebook Directory
: I leave this blank, you can change this if it's helpful for you.
- Once you've selected all your parameters, go ahead and launch your notebook by clicking the
Launch
button. This will take you to a new page with all your running sessions - At first, your session will have a grey background and say "Queued". If you didn't request too much RAM or GPUs or CPUs, your job should land within a minute or two
- Once your job lands, it will have a blue background and say "Pending". You will have to continue waiting for a few minutes while your environment sets up.
- Once your job finishes seting up, it will have a green background and be ready! You can open the environment by clicking on the blue
Connect to Jupyter
button on your job- If you leave this page, you can always get back to it by going to https://ood.ris.wustl.edu and clicking the
My Interactive Sessions
button at the top of the page.
- If you leave this page, you can always get back to it by going to https://ood.ris.wustl.edu and clicking the
If your job is stuck without landing, or your job lands and immediatly deletes itself, here are some commands you can try.
- SSH into the login node with
ssh <wustl-key>@compute1-client-<N>.ris.wustl.edu
- Use
bjobs -wa
to show all of your current and recent jobs - If your job hasn't landed yet (still grey), then run
bjobs -l <job-id>
, which will give you a reason for why your job hasn't landed yet - If your job has landed but is still blue after more than 10-15 minutes of waiting, then you can run
bpeek <job-id>
to view the current status of the job. - If your job exits right away, your job is crashing for some reason. You should run
cd ondemand/data/sys/dashboard/batch_connect/sys/jupyter/output/
, usels -lh
to find the most recently created folder, andcd
into that folder. You can then view the output withcat output.log
(or useless output.log
if your file is really long. Exitless
by typing in the letterq
)- One common scenario why jobs might crash immediatly is because storage1 is not set up correctly. If you try leaving the
Mounts
parameter empty and your new job doesn't crash, it means your RIS Storage hasn't been set up. Submit a ticket on RIS
- One common scenario why jobs might crash immediatly is because storage1 is not set up correctly. If you try leaving the
Once you've connected, you'll be presented with the Jupyter Lab interface. If you've never used Jupyter Lab before, here's a quick intro:
At first, you should see the Launcher taking up most of the screen. You should also see a file explorer sidebar at the left showing all your files (which may be empty if you haven't used RIS before), and a few action buttons at the top left: A blue plus, which opens new Launchers, a New Folder button, an Upload Button, and a Refresh Button.
Now that you're a little familiar with the Jupyter Lab interface, it's time to start running code.
- Open a new terminal using the Launcher. At the prompt, you should see something like
<wustl-key>@compute1-exec-216:~$
- Run the command
git clone https://github.com/saumikn/RIS-quickstart.git
. This will download all of the files needed for this tutorial, and save them into theRIS-quickstart
directory (i.e. folder). - Enter into this directory in your terminal with the command
cd RIS-quickstart
- Load the base Conda environment with
conda activate base
.- I will explain how Conda works in Task 6 - just go with it for now :)
- Once you do this, your prompt should say something like
(base) <wustl-key>@compute1-exec-216:~/RIS-quickstart$
- Open up the
task-04.py
Python file by navigating toRIS-quickstart
in the file browser, and double clicking on the file name- Nothing specific for you to do here, just good practice to look at what you're running before you run it
- In this case, you should see that this script is simply computing the first 30 Fibonacci numbers
- Back in your terminal, run the command
python task-04.py
- Once you do this, you'll get a progress bar in your terminal that shows you the status of your code, which should take about a second to run.
- Afterwards, you can read the output of the code by opening the
RIS-quickstart
directory in the file browser, and then opening the01-output.txt
file. - Because this output is so small, it's fine to save the output in the home directory (i.e. anywhere inside the
~
directory). If this output was expected to be very large, it would be better to save the output in RIS Storage. To do this, comment out line 12 and uncomment line 13. Then, rerun the script. - You can't open files saved in RIS Storage directly, but you can use the terminal to view the files. For example, you can do this with the command
cat /storage1/fs1/<faculty-key>/Active/01.txt
Running code in Jupyter is almost as easy.
- First, you should open the
task-05.ipynb
file, by double clicking on it in the file browser. - By default, the notebook will try to open in the
base
conda kernel. However, I'm currently encountering a bug and this kernel doesn't load. To fix this, you can click on thebase
orNo Kernel
button at the top left of the screen, and instead select thePython 3
kernel. The choice of which kernel you use determines which Conda environment your code runs with. We'll explain more about this in the next step. - Then, you can run your code by clicking the
Run → Run All Cells
button in the menu options. - Once the code finishes running, you can view the output saved to
02-output.txt
. - You can instead edit the code to save output file in RIS Storage instead of your home directory.
Both of these options worked easily and you were able to run the code which I've provided. However, these programs were very simple, and only required Python itself. Most interesting Data Science code you write will require additional packages such as Pandas
, Tensorflow
, or Matplotlib
. These need to be downloaded and installed on your system first before you can use them. In order to manage these packages, and anything else you might need, we're going to use a tool called Conda. Using Conda, we can specifiy in an environment.yml
file exactly which packages are required to run your code, making your code much more maintanable and replicable by others.
First, you need to run the following five commands. These commands are something specific to the RIS platform and allow you to save Conda environments directly to your storage cluster at /storage1/fs1/
which has a capacity of at least 5TB, rather than your home directory which can only store 10GB. You only need to run these commands a single time when you first set up RIS, not every time you create a new environment. You should edit the commands with your specific <faculty-id>
and then run the commands one at a time in your terminal.
mkdir -p /storage1/fs1/<faculty-id>/Active/.conda/pkgs
conda config --add pkgs_dirs /storage1/fs1/<faculty-id>/Active/.conda/pkgs
mkdir -p /storage1/fs1/<faculty-id>/Active/.conda/envs
conda config --add envs_dirs /storage1/fs1/<faculty-id>/Active/.conda/envs
conda config --set env_prompt '({name})'
- Use this last command exactly, you shouldn't replace
{name}
with anything else
- Use this last command exactly, you shouldn't replace
Commands to create a new Conda Environment (must repeat for every new Conda Environment you want to create)
Next, we're going to build a single environment which has already been predefined in the environment.yml
file included in the RIS-quickstart
directory. In order to do this, you must run the following commands.
- First, you should open the
environment.yml
file just to understand how the file is structured.name
refers to the name of the environment, how Conda distinguishes it from other environmentschannels
tells Conda where to look to download from. Almost all code you will ever need is available inconda-forge
, but you can specify a different channel here if you like.dependencies
defines which packages need to be installed from Conda. You can specify an version of a package with=
(e.g.python=3.10
) or with<
or>
. Otherwise, Conda will download the latest possible versionpip
defines which packages need to be installed from Pip. You can specify a version of a package with==
(e.g.chess==1.9.4
).- In general, if you can download a package from either Conda or Pip, you should choose Conda, unless you have a reason to choose Pip (either Conda somehow breaks things or the package itself recommends pip for some reason)
- Next, run
mamba env create -f environment.yml -v
. This will take around 10 minutes to run, as Conda will need to download a bunch of new Python packages totaling several GBs, and then install them into a new environment it's creating.- You could also run
conda env create -f environment.yml -v
, which will give you the same thing in the end, butmamba
is significantly faster thanconda
.
- You could also run
- Once it's finished installing, you can enter the environment with
conda activate <env-name>
, filling in<env-name>
- e.g.
conda activate quickstart-env
- e.g.
- To set up Jupyter, run
python -m ipykernel install --user --name <env-name> --env LD_LIBRARY_PATH $CONDA_PREFIX/lib
, filling in<env-name>
- e.g.
python -m ipykernel install --user --name quickstart-env --env LD_LIBRARY_PATH $CONDA_PREFIX/lib
- e.g.
- To set up your terminal, run
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib
.- When you do this, it will probably ask you to reactivate your environment with
conda activate <env-name>
, and you should do this as well
- When you do this, it will probably ask you to reactivate your environment with
Running a Python script in your custom Conda Environment is just as easy as running it in the base environment (Step 5). All you need to do ensure that your terminal is currently activated with the intended environment. You can tell which environment you're in based on the terminal prompt. If the prompt starts with (conda)
or (base)
, it means you're in the base
enviornment that ships with Conda, which contains almost no packages outside of Python itself. If the prompt starts with anything else (e.g. (quickstart-env)
), it means you're in the environment with that name. If your prompt does not contain anything like this, you are not in a Conda environment.
To run a script, run the following commands:
- Enter your Conda environment with
conda activate <env-name>
, filling in<env-name>
- e.g.
conda activate quickstart-env
- You will get a WARNING that you are overwriting environment variables set in the machine. You can ignore this, as we explicitly set up this behavior earlier when we ran the
conda env config vars set LD_LIBRARY_PATH=$CONDA_PREFIX/lib
command in the last step - Once you do this, your prompt should say something like
(quickstart-env) <wustl-key>@compute1-exec-216:~/RIS-quickstart$
- e.g.
- Run the Python script with
python task-07.py
What happens if you try running this script in the base environment? Go ahead and try it by running conda activate base
and python task-07.py
.
Running a Jupyter Notebook in your custom Conda Environment is also easy.
- Open the
task-08.py
notebook by double clicking on the file inside of theRIS-quickstart
directory in the file explorer - By default, Jupyter may open your notebook in the base environment instead of your custom environment. You can check which environment you're in with the button at the top right, which will either say
quickstart-env
orbase
orPython 3
. - Switch to the
quickstart-env
envrionment by clicking on the button at the top right, and selectingquickstart-env
in the dropdown- When you first select an environment, you will see a grey circle with a lightning bolt in the center. This means your environment is loading, and you can't run anything yet
- Once your environment finishes loading, the symbol will turn into a white circle instead. Whenever you see a white circle, this means your environment is ready to run code
- If your notebook is in the middle of running code, this symbol will be a grey circle
- You can run the entire notebook by clicking on the
Run → Run All Cells
button.- You can also run cells one at a time by either clicking on the grey triangle in the toolbar, or using the
Shift+Enter
keyboard shortcut. These will run whichever cell you have selected, highlighted with a blue bar
- You can also run cells one at a time by either clicking on the grey triangle in the toolbar, or using the
If you've already built a Conda environment and need to add additional dependencies to your project, you have two options.
The easiest option is to just edit your enviroment with the following commands
- Edit your
environment.yml
file with your intended changes - Run
mamba env update --f environment.yml -v --prune
- You can also use
conda
instead ofmamba
, and both will give you the same thing in the end, butmamba
is faster
- You can also use
Occassionally, the above approach will give you an error, for whatever reason. If this happens, the next easiest option is to delete your current environment and recreate it from scratch. To do this, run the following steps:
- Enter the environment you want to delete with
conda activate <env-name>
, filling in<env-name>
- e.g.
conda activate quickstart-env
- e.g.
- Unlink the environment from Jupyter with
jupyter kernelspec uninstall <env-name>
, filling in<env-name>
- e.g.
jupyter kernelspec uninstall quickstart-env
- This step essentially undoes the command entered earlier in Task 6 of
python -m ipykernel install --user --name <env-name> --env LD_LIBRARY_PATH $CONDA_PREFIX/lib
- e.g.
- Deactivate the environment you want to delete with
conda deactivate
- Delete the environment you want to delete with
conda remove --name <env-name> --all
, filling in<env-name>
- e.g.
conda remove --name quickstart-env --all
- e.g.
- Recreate the new environment by following all of the steps in Task 6, in the section on "Commands to create a new Conda Environment"
I've created a Jupyter Notebook file called task-10.ipynb
. Your task is to create a new Conda environment which can run the task-10.ipynb
file without any errors. In addition, you should try to make your environment file as small and self-contained as possible. You should try to only list Python packages that you explicitly need in your code, and remove all dependencies which aren't necessary for your code!
For this exercise, you should practice updating your existing environment.yml
files using the steps listed in Task 9. However, it should be noted that in general it is better practice to have entirely different environment.yml
for each seperate projects (each located in seperate directories), rather than a single Conda environment which can run multiple projects. Otherwise, you wouldn't be able to run e.g. task-08.ipynb
and task-10.ipynb
at the same time, since both require mostly different dependencies, and sharing your work becomes more difficult.
Sometimes, you will want to run the same Python script with many different parameters. You could do this inside Jupyter Lab, but this could get cumbersome if you have hundreds or thousands of parameters you want to run. Instead, we can submit this Python script as a seperate job on the RIS cluster. This will allow us to easily manage multiple parameters, and allow us potentially use more resources, simultaneously, allowing us to finish running our code faster
As an example, we will demonstrate how to submit the task-11.py
script as seperate jobs, so we can run run the code simultanously, with different parameters. If you take a look inside the task-11.py
file, you will see that this script takes in a single parameter, shape
, creates a random matrix in Pytorch with that shape, loads the matrix on either the CPU or GPU depending on availability, performs a number of matrix operations on it, saves the matrix to our RIS Storage, and prints the runtime.
Here are the steps for how to submit task-11.py
as jobs.
- Open the
task-11.py
script and edit the script to point to the correct RIS Storage path- Takes in a single parameter,
shape
- Creates a random matrix in Pytorch with that shape
- Loads the matrix on either the CPU or GPU depending on availability
- Performs 100 matrix operations on it
- Times how long the operations takes
- Saves the matrix to our RIS storage (make sure to edit the
output_folder
parameter)
- Takes in a single parameter,
- Open the
task-11-gpu.sh
file, and edit the parameters to point to the correct RIS Storage path and use your correct<wustl-key>
- You should also read the comments to understand what each of the other parameters represent
- Run
mkdir -p /storage1/fs1/<faculty-id>/Active/quickstart/job_output
, as the-o
parameter intask-11-gpu.sh
requires this directory to exist. - Run
bgadd -L 100 /<wustl-key>/limit100
. This will create a new fair share allocation group for you so you can run up to 100 jobs simultaneously.- Don't misuse this, or RIS can remove your access to the Compute Cluster
- In your terminal, run the command
unset LSF_DOCKER_PORTS
. Otherwise, you will not be able to submit the jobs. - Finally, you can submit the jobs with
bash task-11-gpu.sh
- Random note, submitting these jobs may create a file in your
RIS-quickstart
directory calledlib_check
, containing the ominous sounding textDie will Fire...
. I don't know what this file does, or what it means, but I don't think it's a problem.
- Random note, submitting these jobs may create a file in your
- While your jobs are running, you can use the command
bjobs -w | sort
to view all the jobs you've created, and their status (if it's pending or running) - Once your job is running, you can view the current output with
bpeek <job-id>
. This command will not work once your job finishes - When your jobs finish, you will get an email with a summary of your job and some statistics.
- To view the output of your jobs, enter the job output directory with
cd /storage1/fs1/<faculty-id>/Active/quickstart/job_output
. You can then see the names of all the files in this directory with thels -lh
command. You can then view a specific job's output with the commandcat <file-name>
- We can then view the data created by these commands by going to
cd /storage1/fs1/<faculty-id>/Active/quickstart/data_processed/gpu
and typing inls -lh
. The actual data is not in a human-readable format, but you can at least see how large each file is.- What do you notice about the file sizes?
- Next, we can try doing the same process, but submitting our code as CPU-only rather than running our code on a GPU. First, open the
task-11-cpu.sh
file, and edit the parameters as needed - Then, run the script with
bash task-11-cpu.sh
(making sure to go back to theRIS-quickstart
directory first withcd ~/RIS-quickstart
) - Once the jobs are done, you can view the data output at
/storage1/fs1/<faculty-id>/Active/quickstart/data_processed/cpu
- Are the file sizes different with the gpu-created files? Why or why not?
- You can also view the job outputs at
/storage1/fs1/<faculty-id>/Active/quickstart/job_output
- What do you notice about the runtimes for each file?
- Hint, try running
ls -v * | xargs -I{} grep -H "Function Runtime" {} | awk -F: '{printf "%-25s %s\n", $1":", $2":"$3}'
At times, you may encounter a scenario where Open OnDemand doesn't work for your code. Maybe you need a different operating system instead of Ubuntu, or you need additional terminal commands such as sudo
, or you need additional features in Jupyter such as Tensorboard or the NVIDIA GPU Dashboards. If any of these is true, you won't be able to use Open OnDemand, you'll have to build your own Docker image and run code for that.
In this section, I'm going to be a little more brief with my instructions, and I'm assuming that anybody that needs to build a Docker container is already proficient with the command line. As always, if you have any questions, you should consult the RIS Documentation or submit a ticket with the RIS Service Desk.
- Create an account on Docker Hub
- Create a new public repository on Docker Hub, named
<container-name>
- Download the Docker Desktop app on your personal computer and sign into your account
- Download the
Dockerfile
anddocker-environment.yml
files from this Github repo onto your personal computer - Customize the
Dockerfile
with whatever requirmenets you need for your setup. The file I've provided is the simplest possibleDockerfile
which will work on the RIS Cluster, but you can make any additional changes you need. - If you want, you can customize the
docker-environment.yml
with additional packages. In general, I would try to keep this environment as minimal as possible, and use the approach described in Task 6 to define custom conda environments for each project. If you need to install additional Jupyter extensions, you will need to do it here, rather than in your project-specific environment configs. - Build your Docker container with
docker build -t <docker-id>/<container-name>:<optional-tag>
, where<docker-id>
is your username on Docker Hub, and<container-name>
is the name of the public repository you created. Each repository can store multiple versions of a container, and each of these are distinguished by their<tag>
. - Push your Docker container to Docker Hub with
docker push <docker-id>/<container-name>:<optional-tag>
. Depending on the size of your container, this could take anywhere from several minutes to several hours. - Log into the compute cluster with
ssh <wustl-key>@compute1-client-<N>.ris.wustl.edu
- This will bring you to one of the RIS login nodes. This is essentially a server where you can launch and view jobs which run your code. However, the login node isn't intended as a computer to run intensive code directly, so you'll have to learn how to submit jobs.
- Download the
jupyter-gpu.sh
file from this Github repo- If you don't need to request a Jupyter Lab instance with a GPU, use
jupyter-cpu.sh
instead
- If you don't need to request a Jupyter Lab instance with a GPU, use
- Edit the
jupyter-gpu.sh
command to refer to your advisor's<faculty-id>
- Launch a new Jupyter Lab instance by running the script
bash RIS-quickstart/jupyter-gpu.sh
with whatever parameters you want to customize.- For example, if you want to launch a job with 100GB of memory, a time limit of 24 hours, and 8 CPUs, you would use
bash RIS-quickstart/jupyter-gpu.sh -m 100GB -t 24 -c 8
- If you don't need a GPU, you would use
bash RIS-quickstart/jupyter-cpu.sh -m 100GB -t 24 -c 8
- For example, if you want to launch a job with 100GB of memory, a time limit of 24 hours, and 8 CPUs, you would use
- In order to check if your job has landed, use the
bjobs -w
command. Once your job lands, the job status will change fromPEND
toRUN
- If your job is taking a really long time to land, you can use the
bjobs -l <job-id>
command to get a detailed description of your job, including the reason why it hasn't landed yet - I created a script called
cluster_status.py
, which allows you to view status of the cluster. You can run ith withpython3 cluster_status.py
(on the login nodes, Python 2 is still the default even though it's 2024). This script is very helpful to see how occupied the cluster is, and can tell you potentially why your jobs aren't landing, if there are open GPUs or CPUs, etc.
- If your job is taking a really long time to land, you can use the
- Once your job has landed, use the
bpeek <job-id>
command to view your Jupyter Notebook url, which will look something likehttp://compute1-exec-207.ris.wustl.edu:8401/lab?token=<token>
- If your job output is too long, you can use
bpeek <job-id> | grep http
as a shorthand to find the url - There are a bunch of additional compute cluster commands which mostly all start with the letter b, and many of them are useful. You can read more about them at https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=started-quick-reference
- If your job output is too long, you can use
- Finally, you can access your Jupyter Lab instance by going to the given URL.
- In order to access this url, you need to be on the WashU VPN - even if you're on campus and connected to WashU's network
- The first time you access the instance, you will need to copy paste the entire URL, including the token.
- Afterwards, you can drop the token, and just enter the shortened url (e.g.
http://compute1-exec-207.ris.wustl.edu:8401/lab
)
Now you're in! Until your job dies, you can continue to access this url to get to Jupyter. You can now continue with Tasks 4 through 11.