tags |
---|
ggg, ggg2023, ggg298 |
[toc]
Today will be scattered across multiple topics, but hopefully this will be the last super-scattered lesson. I'll post a topics schedule for the rest of the quarter shortly, but most of the remaining lab sessions will focus on specific topics, with a few hands-on practice sessions in the middle.
Rough outline:
- logging into farm and running RStudio Server from Windows and Mac.
- working on multiuser systems - a brief intro
- UNIX commands more generally - cd, ls, rm, files and folders, and commands
- running the sourmash workflow on farm - what are all the steps?
- what is sourmash doing "underneath" - a brief introduction to k-mer comparisons and Jaccard similarity (if we have time)
Revisit week 2 notes on logging into farm
Revisit week 2 notes on running RStudio Server
Step by step - what is happening here?
- logging in to farm (first time)
- srun command
- modules command
- rserver-farm
- logging in again with ssh
- connecting to RStudio Server
- farm head node doesn't run big compute jobs, so you can't run RStudio there
- farm compute nodes are "hidden" from the Internet, so you can't connect directly to them
- so you tell ssh to "tunnel" from your laptop through farm head node to the farm compute nodes
It's actually kind of astonishing that it works... but it generally does!
Question: why do you need two ssh connections to farm?? Why can't we do all of the commands in one!?
Adapted from: Introduction to the UNIX Command Line
- visualize file/directory structures
- understand basic shell vocabulary
- gain exposure to the syntax of shell & shell scripting
- look at the contents of a directory
- commands:
pwd
,ls
,cd
The shell is a computer program that uses a command line interface (CLI) to give commands made by your keyboard to your operating system. Most people are used to interacting with a graphic user interface (GUI), where you can use a combination of your mouse and keyboard to carry out commands on your computer. We can use the shell through a terminal program.
Everything we can do using our computer GUI, we can do in the shell. We can open programs, run analyses, create documents, delete files and create folders. We should note that folders are called directories at the command line. For all intents and purposes they can be used interchangeably but if you'd like more information please see "The folder metaphor" section of Wikipedia.
The ease of getting things done via the shell will increase with your exposure to the program.
To get started with this section,
- open a terminal window (in ssh or with RStudio)
- run:
cd
wget https://s3-us-west-1.amazonaws.com/dib-training.ucdavis.edu/shell-data.zip
unzip shell-data.zip
This will preload some data that we will explore.
When we open up the terminal and/or connect to farm with ssh, we will see a line of text. This is a prompt statement. It can tell us useful things such as the name of the directory we are currently in, our username, or what computer we are currently running terminal on.
Let's take a look around. First, we can use the print working directory command see what directory we are currently located in.
pwd
This gives us the absolute path to the directory where we are located. An absolute path shows the complete series of directories you need to locate either a directory or a file starting from the root directory of your computer.
What is the root? A useful way to start thinking about directories and files is through levels. At the highest level of your computer, you have the root directory. Everything that is contained in your computer is located in directories below your root directory.
We can also look at the contents of the directory by using the ls
("list") command:
ls
This command prints out a list of files and directories that are located in our current working directory. Let's look at the subdirectory data/
.
To change the working directory, we need to use the cd
("change
directory") command. Let's move into the data directory.
cd data
Let's have a look around.
ls
We can see the following files:
MiSeq Slide1.jpg hello.sh nano1.png README.md gvng.jpg nano2.png
However, this directory contains more than the eye can see! To show hidden files we can use the -a
option.
ls -a
We will see the following:
. MiSeq Slide1.jpg hello.sh nano1.png .. README.md gvng.jpg .hidden nano2.png
Three new items pop up .
, ..
and .hidden
.
Using options with our commands allows us to do a lot! But how did we know to add -a
after ls? Most commands offer a --help
. Let's look at the available options that ls
has:
ls --help
Here we see a long list of options. Each option will allow us to do something different.
For example,
ls -F
will give indicators as to whether something is a file, a directory, or whatever.
::::info Let's check out explainshell.com together!! ::::
We can also combine commands:
ls -aFl
This combination of options will list all the contents of the directory and differentiate between file types.
- paths
- look at the contents of files
- perform functions outside of the directory you are in
- intro to the wildcard expression:
*
- copy, move and remove files
- create and remove directories
- understand the structure of commands
- commands:
cat
,cp
,mv
,rm
,mkdir
Now we have seen how to navigate around our computers and seeing what is located in the directory we are in. But some of the beauty of the shell is that we can execute activities in locations that we are not currently in. To do this we can either use an absolute path or a relative path. A relative path is the path to another directory from the the one you are currently in.
Navigate into the tmp1
directory located in the .hidden
directory.
cd .hidden/tmp1
Here we see two files notit.txt
and thisinnotit.txt
. We can see what is in the directories using the cat
command which concatenates and prints the content of the file we list.
cat thisinnotit.txt
This is not the text file you're looking for
NOTE - you can use TAB to do filename completion, so if you type cat this
and then press your Tab key once, it will autocomplete if there is a unique match. If there is more than one match, the first Tab will do nothing, and the second will show all the possible matches.
Let's see what else is in the other tmp directories:
ls ../tmp2
and we can see the contents of tmp3
ls ../tmp3
So, even though we are in the tmp1/
directory, we can see what is in other directories by using the relative path to the directory of interest. Note we can also use absolute paths too. You may have noticed the ../
this is how to get to the directory above the one you are currently located in.
Note: in this case, we have access to the RStudio file browser, too, which is really nice. But in the future we won't. So we can use the file browser today, but on Farm we'll have to get by with just the command line interface and no other interface!
CHALLENGE: Use the absolute path to list the files in the tmp2 directory.
Wouldn't it be nice to see the contents of all the tmp directories at once? We can use a regular expression to capture a sequence of characters (like the numbers 1, 2 and 3 at the end of the tmp directories). We can use the wild card character *
, which expands to match any amount of characters.
ls ../tmp*
../tmp1: notit.txt thisinnotit.txt ../tmp2: anotherfile.txt ../tmp3: closebutnotit.txt youfoundit.txt
So, even though we are in the tmp1
directory we can use a relative path.
We are quite used to moving, copying and deleting files using a GUI. All of these functions can be carried out at the command line with the following commands:
Copy files with the cp
command by specifying a file to copy and the location of the copied file. Here we will copy the thisinnotit.txt
into the file thisisacopy.txt
.
cp thisinnotit.txt thisisacopy.txt
The syntax for the copy command is cp <source_file> <destination_file>
. Using this syntax we can copy files to other directories as well:
cp thisinnotit.txt ../tmp2
If we navigate to the tmp2 directory and list the files that are in it we will see the thisinnotit.txt
file has been copied to the tmp2 directory.
cd ../tmp2
ls -l
CHALLENGE: Use the mv
command to move the thisinnotit.txt
file from tmp2 to tmp3.
Once we know how to copy and move files, we can also copy and move directories. We can create new directories with the command mkdir
. Let's make a new directory called tmp4
cd ../
mkdir tmp4
ls -l
The shell is quite powerful and can create multiple directories at once. It can create multiple directories in the current working directory:
mkdir tmp5 tmp6
ls -l
or it can create a series of directories on top of one another:
mkdir -p how/deep/does/the/rabbit/hole/go
We can use tab complete to get to the go
directory. Type cd h
then hit tab. If you hit tab enough times your command will eventually read:
cd how/deep/does/the/rabbit/hole/go/
You can see that we've created a bit of a monster directory structure...
CHALLENGE: Navigate to the data directory and use the rm
command to remove the how
directory and all its contents.
This nicely hints at the power of the shell - you can do certain things (in this case, create a nested hierarchy of directories) much more easily in the shell. But that power cuts both ways - you can also mess things up more easily in the shell!
This lesson focused on file and directory exploration because that's something everyone needs to know, and all these commands will work on pretty much any computer that is running a UNIX compatible shell (including Mac OS X and Windows Subsystem for Linux).
Google (and especially stackoverflow) is your friend! Use Internet search whenever you have questions about what a command does, or what commands to use to achieve a particular tasks.
(Adapted from Running programs on remote computers and retrieving the results.)
You don't have to be logged in just once!
On Mac OS X, you can use Command-N to open a new Terminal window, and then ssh into farm from that window too.
On Windows, you can open a new connection from MobaXterm simply by double clicking your current session under "User sessions."
In RStudio Server, you can open a second Terminal.
What you'll end up with are different command-line prompts on the same underlying system.
They share:
- directory and file access (filesystem)
- access to run the same programs, potentially at the same time
They do not have the same:
- current working directory (
pwd
) - running programs, and stdin and stdout (e.g.
ls
in one will not go to the other) - activate module or conda environments (more on this later)
These are essentially different sessions on the same computer, much like you might have multiple folders or applications open on your Mac or Windows machine.
You can log out of one independently of the other, as well.
And you can have as many terminal connections as you want! You just have to figure out how to manage them :).
If you start using remote computers frequently, you may end up logging into several different computers and have several different sessions open at the same time. This can get ...confusing!
There are several ways to help track where you are and what you're doing.
One is via the command prompt. You'll notice that on farm, the command prompt contains three pieces of information by default: your username, the machine name ('farm'), and your current working directory! This is precisely so that you can look at a terminal window and have some idea of where you're running.
You might also find the following commands useful:
This command will give you your current username:
whoami
and this command will give you the name of the machine you're logged into:
hostname
These can be useful when you get confused about where you are and who you're logged in as :)
You can use the ps
command to see what your account, and other accounts,
are running:
ps -u datalab-02
This lists all of the different programs being run by that user, across all their shell sessions.
The key column here is the last one, which tells you what program is running under that process.
You can also get a sort of "leaderboard" for what's going on on the shared computer by running
top
(use 'q' to exit).
This gives a lot of information about running processes, sorted by who is
using the most CPU time. If the system is really slow, it may be because
one or more people are running a lot of things, and top
will help you
figure out if that's the problem.
This is one of the consequences of having a shared system. You have access to extra compute, disk, and software that's managed by professionals (yay!), but you also have to deal with other users (boo!) who may be competing with you for resources. We'll talk more about this when we come to the Slurm class, where we talk about bigger analyses and the SLURM system for making use of compute clusters by reserving or scheduling compute. (You're already doing this with srun.)
What are the commands doing?
Step by step:
- git clone
- mamba and conda commands
- snakemake
- file transfer options
How to reconnect:
- change to the right directory
- activate the relevant conda environment
We'll just start doing this regularly in class, so it's worth getting it down to a routine :)
- close down everything (ssh, rstudio, terminal, etc.)
- start terminal & ssh connection
- srun, module load, rstudio server
- (separate window) ssh tunnel
- open RStudio Server in web browser
- conda activate
- change directory
- run snakemake
Command reference:
# allocate a compute node
srun -p high2 --time=3:00:00 --nodes=1 \
--cpus-per-task 1 --mem 5GB --pty /bin/bash
# load R and RStudio Server modules
module load spack/R/4.1.1
module load rstudio-server/2022.07.1
# run RStudio on node
rserver-farm
Inside RStudio:
# go to right directory/folder
cd
cd 298-compare
# activate the set of software
conda activate lab2
# run snakemake
snakemake -j 1
You're probably in the wrong working directory if:
ls
doesn't show you the files you expect to seepwd
doesn't give the right directory
solution: type cd ~/
, ls
, and then look for right directory; then cd
into it.
If you try to run snakemake and it says "command not found", you're probably not in the right conda environment.
The conda environment is listed in the first part of your prompt; see:
(base) datalab-02@farm:~$
^^^^
the 'base' here means you're in the conda environment named 'base'. You probably want to be in the conda environment called 'lab2':
conda activate lab2
If you run module load rstudio-server/2022.07.1
and see: RStudio Server is not allowed on the head node.
then you forgot to do the srun to allocate yourself a compute node!
See rendered notebook.
Reminder, here is our sourmash workflow.
sourmash details to discuss from workflow -
sourmash sketch dna
- extracts k-merssourmash compare
- loads all k-mers, comparessourmash plot
- plots relationship
sourmash uses a "subcommand" format for many things, e.g. sketch
.
- options - long and short
- how to find out more:
- documentation
-h/--help
CTRl-C to kill rstudio server process
exit
to exit the compute node srun shell