-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job_manager: Add htcondorVC3 job manager #251
base: maint-0.6
Are you sure you want to change the base?
Changes from 74 commits
362d8ed
5530ddb
37f3cb9
0385162
7fc09fe
50d4c8f
9844e83
8e363be
94c61e9
9243471
031417c
af37025
21ba4c3
e1646e9
889c952
66da263
71d2ad3
8238cab
53d4af7
7019dbb
448b2a4
38a5b65
40bd56e
e4991c7
3951c04
8ffa28c
b6af97d
9bc053f
70cd180
d783329
0bea8d0
8540014
dd08a86
556f32a
5e0a351
11b1c92
3f9b215
e5926c9
02f77a2
e1db2d2
5b9e5f5
e0c70bf
a34e8c5
f5a69d8
2e2ed50
7d25d63
b40d1ca
594ac68
46157f9
9ebfb15
5614c24
1a11c04
c787230
1018720
54611b9
a046739
341a500
c544033
0cce571
dd61407
87ec2c2
e56d983
1f68d9f
8b2cfd0
fb893cd
1643750
039c5e2
da97df0
70bfbbc
5afab13
0549154
189814b
49cb367
0e76609
004d064
c5ef635
48896bc
6c41e90
643d027
1eed46c
c10886f
a17a4b3
3c23c31
a651465
9f524b4
b8684b5
37b865b
324425b
1010e51
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,9 +7,11 @@ | |
FROM python:3.6-slim | ||
|
||
ENV TERM=xterm | ||
|
||
RUN apt-get update && \ | ||
apt-get install -y vim-tiny && \ | ||
pip install --upgrade pip | ||
pip install --upgrade pip && \ | ||
pip install htcondor==8.9.6 retrying | ||
|
||
RUN export DEBIAN_FRONTEND=noninteractive ;\ | ||
apt-get -yq install krb5-user \ | ||
|
@@ -55,6 +57,7 @@ RUN update-ca-certificates | |
|
||
COPY CHANGES.rst README.rst setup.py /code/ | ||
COPY reana_job_controller/version.py /code/reana_job_controller/ | ||
COPY reana_job_controller/htcondor_submit.py /code/htcondor_submit.py | ||
WORKDIR /code | ||
RUN pip install requirements-builder && \ | ||
requirements-builder -l pypi setup.py | pip install -r /dev/stdin && \ | ||
|
@@ -77,5 +80,6 @@ EXPOSE 5000 | |
|
||
ENV COMPUTE_BACKENDS $COMPUTE_BACKENDS | ||
ENV FLASK_APP reana_job_controller/app.py | ||
ENV REANA_LOG_LEVEL DEBUG | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should not be necessary to hard-code There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to add a bit more information on this: the way of passing this configuration is a bit cumbersome right now as one would have to: go inside RWC, read the config value and pass it as an environment variable to RJC. In reanahub/reana#277 there is a description of the current process and possible solutions we could take to centralise these configuration variables. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this should have been taken out, I missed this when cleaning things up. |
||
|
||
CMD ["flask", "run", "-h", "0.0.0.0"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,215 @@ | ||
#!/bin/bash | ||
|
||
# Replicate input files directory structure | ||
# @TODO: This could be executed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have:
It is not clear which files are for what. What about using:
so that we can easily support several backends and distinguish between them? This would mean that we'd have to move quite a few CERN-specific files from (An alternative would be to use the same directory but strictly "cern" and "vc" naming prefix everywhere.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like creating different subdirectories in etc:
As you said, handling several different backends would be easier to manage and visualize than prefixes or suffixes ( If you're OK with the subdirectories in |
||
# in +PreCmd as a separate script. | ||
|
||
# Expected arguments from htcondor_job_manager: | ||
# $1: workflow_workspace | ||
# $2: DOCKER_IMG | ||
# $3 -> : cmd | ||
|
||
# Defining inputs | ||
DOCKER_IMG=$2 | ||
REANA_WORKFLOW_DIR=$1 | ||
|
||
# Get static version of parrot. | ||
# Note: We depend on curl for this. | ||
# Assumed to be available on HPC worker nodes (might need to transfer a static version otherwise). | ||
get_parrot(){ | ||
curl --retry 5 -o parrot_static_run http://download.virtualclusters.org/builder-files/parrot_static_run_v7.0.11 > /dev/null 2>&1 | ||
if [ -e "parrot_static_run" ]; then | ||
chmod +x parrot_static_run | ||
else | ||
echo "[Error] Could not download parrot" >&2 | ||
exit 210 | ||
fi | ||
} | ||
|
||
populate(){ | ||
if [ ! -x "$_CONDOR_SCRATCH_DIR/parrot_static_run" ]; then get_parrot; fi | ||
mkdir -p "$_CONDOR_SCRATCH_DIR/$REANA_WORKFLOW_DIR" | ||
local parent="$(dirname $REANA_WORKFLOW_DIR)" | ||
$_CONDOR_SCRATCH_DIR/parrot_static_run -T 4 cp --no-clobber -r "/chirp/CONDOR/$REANA_WORKFLOW_DIR" "$_CONDOR_SCRATCH_DIR/$parent" | ||
} | ||
|
||
find_module(){ | ||
module > /dev/null 2>&1 | ||
if [ $? == 0 ]; then | ||
return 0 | ||
elif [ -e /etc/profile.d/modules.sh ]; then | ||
source /etc/profile.d/modules.sh | ||
fi | ||
module > /dev/null 2>&1 | ||
return $? | ||
} | ||
|
||
# Discover the container technology available. | ||
# Currently searching for: Singularity or Shifter. | ||
# Returns 0: Successful discovery of a container | ||
# 1: Couldn't find a container | ||
find_container(){ | ||
declare -a search_list=("singularity" "shifter") | ||
declare -a found_list=() | ||
local default="shifter" | ||
local cont_found=false | ||
|
||
|
||
for cntr in "${search_list[@]}"; do | ||
cntr_path="$(command -v $cntr)" | ||
if [[ -x "$cntr_path" ]] # Checking binaries in path | ||
then | ||
if [ "$(basename "$cntr_path")" == "$default" ]; then | ||
CONTAINER_PATH="$cntr_path" | ||
return 0 | ||
else | ||
found_list+=("$cntr_path") | ||
cont_found=true | ||
fi | ||
fi | ||
done | ||
# If VC3 didn't automatically load a module (fail-safe) | ||
if [ ! "$cont_found" ]; then | ||
for cntr in "${search_list[@]}"; do | ||
find_module | ||
module_found=$? | ||
if [ $module_found == 0 ]; then | ||
for var in ${search_list[*]}; do | ||
module load $var 2>/dev/null | ||
var_path="$(command -v $var 2>/dev/null)" | ||
if [ "$(basename "$var_path")" == "$default" ]; then | ||
CONTAINER_PATH="$var_path" | ||
return 0 | ||
else | ||
found_list+=("$var_path") | ||
cont_found=true | ||
fi | ||
done | ||
fi | ||
done | ||
fi | ||
|
||
# If default wasn't found but a container was found, use that | ||
if (( "${#found_list[@]}" >= 1 )); then | ||
CONTAINER_PATH=${found_list[0]} | ||
return 0 | ||
else | ||
return 1 # No containers found | ||
fi | ||
} | ||
|
||
# Setting up cmd line args for singularity | ||
# Print's stdout the argument line for running singularity utilizing | ||
setup_singularity(){ | ||
# TODO: Cleanup calling of this function | ||
|
||
# Send cache to $SCRATCH or to the condor scratch directory | ||
# otherwise | ||
if [ -z "$SCRATCH" ]; then | ||
CONTAINER_ENV="SINGULARITY_CACHEDIR=\"\$_CONDOR_SCRATCH_DIR\"" | ||
else | ||
CONTAINER_ENV="SINGULARITY_CACHEDIR=\"\$SCRATCH\"" | ||
fi | ||
|
||
CNTR_ARGUMENTS="exec -B ./$REANA_WORKFLOW_DIR:$REANA_WORKFLOW_DIR docker://$DOCKER_IMG" | ||
|
||
} | ||
|
||
# Setting up shifter. Pull the docker_img into the shifter image gateway | ||
# and dump required arguments into stdout to be collected by a function call | ||
setup_shifter(){ | ||
#TODO: Cleanup calling of this function | ||
# Check for shifterimg | ||
if [[ ! $(command -v shifterimg 2>/dev/null) ]]; then | ||
echo "Error: shifterimg not found..." >&2 | ||
exit 127 | ||
fi | ||
|
||
# Attempt to pull image into image-gateway | ||
if ! shifterimg pull "$DOCKER_IMG" >/dev/null 2>&1; then | ||
echo "Error: Could not pull img: $DOCKER_IMG" >&2 | ||
exit 127 | ||
fi | ||
|
||
# Put arguments into stdout to collect. | ||
echo "--image=docker:${DOCKER_IMG} --volume=$(pwd -P)/reana:/reana -- " | ||
} | ||
|
||
# Setting up the arguments to pass to a container technology. | ||
# Currently able to setup: Singularity and Shifter. | ||
# Creates cmd line arguements for containers and pull image if needed (shifter) | ||
# Global arguments is used as the arguments to a container | ||
setup_container(){ | ||
# Need to cleanup to make more automated. | ||
# i.e. run through the same list in find_container | ||
local container=$(basename "$CONTAINER_PATH") | ||
|
||
if [ "$container" == "singularity" ]; then | ||
setup_singularity | ||
elif [ "$container" == "shifter" ]; then | ||
CNTR_ARGUMENTS=$(setup_shifter) | ||
else | ||
echo "Error: Unrecognized container: $(basename $CONTAINER_PATH)" >&2 | ||
exit 127 | ||
fi | ||
} | ||
|
||
######## Setup environment ############# | ||
# @TODO: This should be done in a prologue | ||
# in condor via +PreCmd, eventually. | ||
############################# | ||
|
||
find_container | ||
if [ $? != 0 ]; then | ||
echo "[Error]: Container technology could not be found in the sytem." >&2 | ||
exit 127 | ||
fi | ||
populate | ||
setup_container | ||
|
||
######## Execution ########## | ||
# Note: Double quoted arguments are broken | ||
# and passed as multiple arguments | ||
# in bash for some reason, working that | ||
# around by dumping command to a | ||
# temporary wrapper file named tmpjob. | ||
tmpjob=$(mktemp -p .) | ||
chmod +x $tmpjob | ||
if command -v aprun; then | ||
echo -n "aprun -b -n 1 -- " > $tmpjob | ||
fi | ||
|
||
echo "$CONTAINER_ENV" "$CONTAINER_PATH" "$CNTR_ARGUMENTS" "${@:3} " >> $tmpjob | ||
bash $tmpjob | ||
res=$? | ||
rm $tmpjob | ||
|
||
if [ $res != 0 ]; then | ||
echo "[Error] Execution failed with error code: $res" >&2 | ||
exit $res | ||
fi | ||
|
||
###### Stageout ########### | ||
# TODO: This shoul be done in an epilogue | ||
# via +PostCmd, eventually. | ||
# Not implemented yet. | ||
# Read files from $reana_workflow_outputs | ||
# and write them into $REANA_WORKFLOW_DIR | ||
# Stage out depending on the protocol | ||
# E.g.: | ||
# - file: will be transferred via condor_chirp | ||
# - xrootd://<redirector:port>//store/user/path:file: will be transferred via XRootD | ||
# Only chirp transfer supported for now. | ||
# Use vc3-builder to get a static version | ||
# of parrot (eventually, a static version | ||
# of the chirp client only). | ||
if [ "x$REANA_WORKFLOW_DIR" == "x" ]; then | ||
echo "[Info]: Nothing to stage out" | ||
exit $res | ||
fi | ||
|
||
parent="$(dirname $REANA_WORKFLOW_DIR)" | ||
# TODO: Check for parrot exit code and propagate it in case of errors. | ||
./parrot_static_run -T 4 cp --no-clobber -r "$_CONDOR_SCRATCH_DIR/$REANA_WORKFLOW_DIR" "/chirp/CONDOR/$parent" | ||
|
||
exit $res |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,29 +16,32 @@ | |
HTCondorJobManagerCERN | ||
from reana_job_controller.job_monitor import (JobMonitorHTCondorCERN, | ||
JobMonitorKubernetes, | ||
JobMonitorSlurmCERN) | ||
JobMonitorSlurmCERN, | ||
JobMonitorHTCondorVC3) | ||
from reana_job_controller.kubernetes_job_manager import KubernetesJobManager | ||
from reana_job_controller.slurmcern_job_manager import SlurmJobManagerCERN | ||
|
||
SHARED_VOLUME_PATH_ROOT = os.getenv('SHARED_VOLUME_PATH_ROOT', '/var/reana') | ||
"""Root path of the shared volume .""" | ||
from reana_job_controller.htcondorvc3_job_manager import \ | ||
HTCondorJobManagerVC3 | ||
from reana_job_controller.variables import (MAX_JOB_RESTARTS, | ||
SHARED_VOLUME_PATH_ROOT) | ||
|
||
COMPUTE_BACKENDS = { | ||
'kubernetes': KubernetesJobManager, | ||
'htcondorcern': HTCondorJobManagerCERN, | ||
'slurmcern': SlurmJobManagerCERN | ||
'slurmcern': SlurmJobManagerCERN, | ||
'htcondorvc3' : HTCondorJobManagerVC3 | ||
} | ||
"""Supported job compute backends and corresponding management class.""" | ||
|
||
JOB_MONITORS = { | ||
'kubernetes': JobMonitorKubernetes, | ||
'htcondorcern': JobMonitorHTCondorCERN, | ||
'slurmcern': JobMonitorSlurmCERN, | ||
'htcondorvc3': JobMonitorHTCondorVC3 | ||
} | ||
"""Classes responsible for monitoring specific backend jobs""" | ||
|
||
|
||
DEFAULT_COMPUTE_BACKEND = 'kubernetes' | ||
DEFAULT_COMPUTE_BACKEND = 'htcondorvc3' | ||
"""Default job compute backend.""" | ||
|
||
JOB_HOSTPATH_MOUNTS = [] | ||
|
@@ -66,8 +69,13 @@ | |
``/usr/local/share/mydata`` in the host machine. | ||
""" | ||
|
||
SUPPORTED_COMPUTE_BACKENDS = os.getenv('COMPUTE_BACKENDS', | ||
DEFAULT_COMPUTE_BACKEND).split(",") | ||
# How is this set in the environment? | ||
# It is hardcoded in Dockerfile and there is no code | ||
# in workflow-controller to override that for the job-controller. | ||
#SUPPORTED_COMPUTE_BACKENDS = os.getenv('COMPUTE_BACKENDS', | ||
# DEFAULT_COMPUTE_BACKEND).split(",") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Very good point for improvement 👍. The way this is set is a bit shady right now as one could pass it in REANA-Workflow-Controller with the process described in https://github.com/reanahub/reana-job-controller/pull/251/files#r416428666, but actually this is set at build time because that knowledge is used to selectively install the correct packages. So when we build images for BUILD_ARGUMENTS="COMPUTE_BACKENDS=htcondorcern,slurmcern,kubernetes" BUILD_TYPE=release make build What is happening behind the scenes is that we pass these build args to REANA-Job-Controller Side note, as it might be of interest here: Then we take this image and we name it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the note Diego, it's very helpful! We'll be sure to follow that from now on. Perhaps we've been 'cheating' while building our images, we have simply been building them straight from the |
||
SUPPORTED_COMPUTE_BACKENDS = DEFAULT_COMPUTE_BACKEND.split(",") | ||
|
||
"""List of supported compute backends provided as docker build arg.""" | ||
|
||
KRB5_CONTAINER_IMAGE = os.getenv('KRB5_CONTAINER_IMAGE', | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have these libraries installed at this specific point in the Dockerfile, maybe because of the deb packages? Otherwise, we could just continue setting them in
setup.py
so they will be installed at this point, and later available in the code .There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, these should be handled in
setup.py
, I'll clean these up.