Skip to content

Commit

Permalink
Merge branch 'tickets/DM-44497'
Browse files Browse the repository at this point in the history
  • Loading branch information
leeskelvin committed Feb 6, 2025
2 parents d2c2dfd + cb1d5e0 commit e7067dd
Showing 1 changed file with 53 additions and 47 deletions.
100 changes: 53 additions & 47 deletions team/drp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,15 @@ Obtaining Accounts

Accounts are issued on demand at the request of an appropriate PI.
For our group, that means you should speak to either Robert or Yusra, and they will arrange one for you.
When your account has been created, you should check that you are a member of the groups ``astro``, ``hsc``, and ``lsst`` (use the :command:`groups` command).
When your account has been created, you should check that you are a member of the groups ``astro``, ``hsc``, ``lsst`` and ``rubin`` (use the :command:`groups` command to check).

.. note::

A new user account may not have the ``lsst`` group added by default.
This group is not being used for anything at present, so it shouldn't be a problem if you are not a member of it.
If you find that you do need to be a member of this group, please contact Robert or Yusra.
The ``lsst`` group is a shared group which allows all Tiger3 users to access the shared stack.
Being a member of this group does not provide access to shared Rubin data.
Instead, the ``rubin`` group is used to control access to Rubin data repositories.
In the prior Tiger2 cluster, the ``hsc`` group was used for both purposes.
If you find that you need to be made a member of any of these groups, please contact Robert, Yusra or Lee.

.. _drp-princeton-available-systems:

Expand All @@ -75,7 +77,7 @@ You can use this node for building software and running small and/or short-lived

The ``/project`` filesystems are NFS-mounted on the Princeton clusters.
As a consequence, the performance of these filesystems will be limited by the network speed between our head node and the filesystem.
For anything more than even the most basic testing, it is therefore strongly recommended that batch processing in your ``/scratch/gpfs/$USER`` space be utilized where possible instead of working directly on the head node (see :ref:`drp-princeton-cluster-usage`).
For anything more than even the most basic testing, it is therefore strongly recommended that batch processing takes place in your ``/scratch/gpfs/RUBIN/user/${USER}`` space (see :ref:`drp-princeton-cluster-usage`).

.. _drp-princeton-shared-stack:

Expand All @@ -88,46 +90,43 @@ To initialize the stack in your shell, run:

.. code-block:: shell
source /scratch/gpfs/HSC/LSST/stack/loadLSST.sh
source /scratch/gpfs/LSST/stack/loadLSST.sh
setup lsst_distrib
By default, the most recent Rubin Environment will be used, as provided by the ``LSST_CONDA_ENV_NAME`` variable within the ``loadLSST.sh`` script.
If you wish to use a different version of the stack, you can do so by first setting the ``LSST_CONDA_ENV_NAME`` variable to the desired version before setting up the Science Pipelines:
If you wish to use a different Rubin Environment, you can do so by first setting the ``LSST_CONDA_ENV_NAME`` variable to the desired version before setting up the Science Pipelines:

.. code-block:: shell
export LSST_CONDA_ENV_NAME="lsst-scipipe-4.0.1"
source /scratch/gpfs/HSC/LSST/stack/loadLSST.sh
export LSST_CONDA_ENV_NAME="lsst-scipipe-9.0.0"
source /scratch/gpfs/LSST/stack/loadLSST.sh
setup lsst_distrib -t <old_version_tag>
# To reset to the default, uncomment this line before setting up again:
# To reset to the default, unset the variable before sourcing the script:
# unset LSST_CONDA_ENV_NAME
A list of all currently installed Rubin Environments can be found by running: ``mamba env list``.

.. note::

The current default shared stack, described above, is a symbolic link to the latest build using the post-:jira:`RFC-584` Conda environment.
Older builds, if any, are available in ``/scratch/gpfs/HSC/LSST/`` with the syntax ``stack_YYYYMMDD``.
Older builds, if any, are available in ``/scratch/gpfs/LSST/stacks`` with the syntax ``stack_YYYYMMDD``.

.. _drp-princeton-repositories:

Repositories
------------

We currently maintain two data repositories for general use on the Princeton clusters:
We currently maintain a single data repository for general use on the Princeton clusters:

- ``/scratch/gpfs/HSC/LSST/repo/main``: The primary HSC/LSST butler data repository, containing all raw HSC data on-disk and a selection of non-embargoed LATISS data.
- ``/scratch/gpfs/HSC/LSST/repo/dc2``: The primary DC2 butler data repository, containing a selection of simulated DC2 data.
- ``/scratch/gpfs/RUBIN/repo/main``: The primary HSC/LSST butler data repository, containing raw HSC RC2 data.

For information on accessing these repositories, including setting up required permissions, see the top-level ``/scratch/gpfs/HSC/LSST/repo/README.md`` file.
For information on accessing repositories, including setting up required permissions, see the top-level ``/scratch/gpfs/RUBIN/repo/README.md`` file.

.. note::

You will not be able to access the data within these repositories without first following the **Database Authentication** instructions in the above ``README.md`` file.

Information more specific to each repository is stored within a secondary ``README.md`` file in each repository's root directory.

.. _drp-princeton-storage:

Storage
Expand All @@ -138,11 +137,14 @@ This space may also be used to store your results.
Note however that space is at a premium; please clean up any data you are not actively using.
Also, be sure to set :command:`umask 002` so that your colleagues can reorganize the shared space.

For temporary data processing storage, shared space is available in :file:`/scratch/gpfs/<YourNetID>` (you may need to make this directory yourself).
For long-term storage of user data, shared space is available in :file:`/projects/HSC/users/<YourNetID>` (you may need to make this directory yourself).
This space is backed up, but it is **not** visible to the compute nodes.

For temporary data processing storage, shared space is available in :file:`/scratch/gpfs/RUBIN/user/<YourNetID>` (you may need to make this directory yourself).
This General Parallel File System (GPFS) space is large and visible from all Princeton clusters, however, it is **not** backed up.
More information on `Princeton cluster data storage <https://researchcomputing.princeton.edu/support/knowledge-base/data-storage>`_ can be found online.

Space is also available in :file:`/scratch/<yourNetID>` and in your home directory, but note that they are not shared across clusters (and, in the case of ``/scratch``, not backed up).
Space is also available in your home directory, but note that it is not shared across clusters.

Use the :command:`checkquota` command to check your current storage and your storage limits.
More information on storage limits, including on how to request a quota increase, can be found at `this link <https://researchcomputing.princeton.edu/support/knowledge-base/checkquota>`_.
Expand All @@ -157,14 +159,8 @@ Jobs are managed on cluster systems using `SLURM <https://slurm.schedmd.com>`_;
Batch processing functionality with the Science Pipelines is provided by the `LSST Batch Processing Service (BPS) <https://pipelines.lsst.io/modules/lsst.ctrl.bps>`_ module.
BPS on the Princeton clusters is configured to work with the `ctrl_bps_parsl plugin <https://github.com/lsst/ctrl_bps_parsl>`_, which uses the `Parsl <https://parsl-project.org>`_ workflow engine to submit jobs to SLURM.

.. note::

Due to changes that occurred in Q1 2023 relating to how disks are mounted on the Tiger cluster, use of the ``ctrl_bps_parsl`` plugin will return an ``OSError`` when used in conjunction with any weeklies older than ``w_2023_09``.
To make use of BPS with older weeklies, you will need to build and set up the ``ctrl_bps_parsl`` plugin yourself.
Refer to the `ctrl_bps_parsl plugin documentation <https://github.com/lsst/ctrl_bps_parsl>`_ and links therein for further details.

To submit a job to the cluster, you will first need to create a YAML configuration file for BPS.
For convenience, two generic configuration files have been constructed on disk at ``/projects/HSC/LSST/bps/bps_tiger.yaml`` and ``/projects/HSC/LSST/bps/bps_tiger_clustering.yaml``.
For convenience, two generic configuration files have been constructed on disk at ``/scratch/gpfs/RUBIN/bps/bps_tiger.yaml`` and ``/scratch/gpfs/RUBIN/bps/bps_tiger_clustering.yaml``.
The former is intended for general use, while the latter is intended for use with quantum clusering enabled.
These files may either be used directly when submitting a job or copied to your working directory and modified as needed.
The following example shows how to submit a job using the generic configuration file:
Expand All @@ -178,20 +174,20 @@ The following example shows how to submit a job using the generic configuration
export NUMEXPR_MAX_THREADS=1
# All submissions must be made from your /scratch/gpfs directory.
cd /scratch/gpfs/$USER
cd /scratch/gpfs/RUBIN/user/${USER}
# Save the output of the BPS submit command to a log file
# (optional, but recommended).
LOGFILE=/path/to/my/log/file.txt
LOGFILE=$(realpath bps_log.txt)
# Submit a job to the cluster.
date | tee $LOGFILE; \
$(which time) -f "Total runtime: %E" \
bps submit /projects/HSC/LSST/bps/bps_tiger.yaml \
--compute-site tiger_1h_1n_40c \
-b /projects/HSC/repo/main \
bps submit /scratch/gpfs/RUBIN/bps/bps_tiger.yaml \
--compute-site tiger_1n_112c_1h \
-b /scratch/gpfs/RUBIN/repo/main \
-i HSC/RC2/defaults \
-o u/$USER/test \
-o u/${USER}/scratch/bps_test \
-p $DRP_PIPE_DIR/pipelines/HSC/DRP-RC2.yaml#step1 \
-d "instrument='HSC' AND visit=1228" \
2>&1 | tee -a $LOGFILE; \
Expand All @@ -202,21 +198,21 @@ The following example shows how to submit a job using the generic configuration
# --extra-qgraph-options "-c isr:doOverscan=False"
A number of different compute sites are available for use with BPS as defined in the generic configuration file.
Select a compute site using the syntax ``tiger_Xh_Xn_Xc``, where ``X`` is replaced by the appropriate number of hours, nodes, and cores.
You may check the available compute sites defined in the generic configuration file using: ``grep "tiger" /projects/HSC/LSST/bps/bps_tiger.yaml``.
Select a compute site using the syntax ``tiger_${NODES}n_${CORES}c_${TIME}h``, replacing the variables by the appropriate number of nodes, cores and hours.
You can check the available compute sites defined in the generic configuration file using: ``grep "tiger_" /scratch/gpfs/RUBIN/bps/bps_tiger.yaml``.
The following table lists the available compute site dimensions and their associated options:

.. list-table::
:header-rows: 1

* - Dimension
- Options
* - Walltime (Hours)
- 1, 5, 24, 72
* - Nodes
- 1, 4, 10
- 1, 10
* - Cores per Node
- 1, 5, 10, 20, 40
- 1, 28, 112
* - Walltime (Hours)
- 1, 5, 24, 72

A list of all available nodes is given using the :command:`snodes` command, or alternatively using :command:`sinfo`:

Expand All @@ -228,15 +224,15 @@ To get an estimate of the start time for any submitted jobs, the :command:`squeu

.. code-block:: shell
squeue -u $USER --start
squeue -u ${USER} --start
To show detailed information about a given node, the :command:`scontrol` may be used:

.. code-block:: shell
scontrol show node <node_name>
It is occasionally useful to be able to bring up an interactive shell on a compute node.
It is occasionally useful to be able to directly log in to an interactive shell on a compute node.
The following should work:

.. code-block:: shell
Expand All @@ -254,16 +250,16 @@ Access to all of the Princeton clusters is only available from within the Prince
If you are connecting from the outside, you will need to bounce through another host on campus first.
Options include:

- Jumping through the Research Computing ``tigressgateway`` host;
- Bouncing your connection through a `host on the Peyton network <http://www.astro.princeton.edu/docs/Hardware>`_ (this is usually the easiest way to go);
- Making use of the `University's VPN service <https://www.net.princeton.edu/vpn/>`_.
- Using the Research Computing gateway.

If you choose the first option, you may find the ``ProxyCommand`` option to SSH helpful.
If you choose the first or second options, you may find the ``ProxyCommand`` or ``ProxyJump`` options to SSH helpful.
For example, adding the following to :file:`~/.ssh/config` will automatically route your connection to the right place when you run :command:`ssh tiger`::

Host tiger
HostName tiger2-sumire.princeton.edu
ProxyCommand ssh coma.astro.princeton.edu -W %h:%p
Host tiger
HostName tiger3.princeton.edu
ProxyCommand ssh coma.astro.princeton.edu -W %h:%p

The following SSH configuration allows access via the Research Computing gateway::

Expand All @@ -272,16 +268,26 @@ The following SSH configuration allows access via the Research Computing gateway
Host tiger* tigressdata*
ProxyCommand ssh -q -W %h:%p tigressgateway.princeton.edu
Host tiger
Hostname tiger2-sumire.princeton.edu
Hostname tiger3.princeton.edu

or alternatively::

Host tigressgateway
HostName tigressgateway.princeton.edu
Host tiger
Hostname tiger3.princeton.edu
ProxyJump tigressgateway

(It may also be necessary to add a ``User`` line under ``Host tigressgateway`` if there is a mismatch between your local and Princeton usernames.)
Entry to ``tigressgateway`` requires `2FA <https://www.princeton.edu/duoportal>`_;
we recommend using the ``ControlMaster`` feature of SSH to persist connections, e.g.::

ControlMaster auto
ControlPath ~/.ssh/controlmaster-%r@%h:%p
ControlPath ~/.ssh/cm/%r@%h:%p
ControlPersist 5m

(It may be necessary to create the directory ``~/.ssh/cm``.)

See also the `Peyton Hall tips on using SSH <http://www.astro.princeton.edu/docs/SSH>`_.

.. _drp-princeton-help-support:
Expand Down

0 comments on commit e7067dd

Please sign in to comment.