Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job_manager: Add htcondorVC3 job manager #251

Open
wants to merge 89 commits into
base: maint-0.6
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
362d8ed
Add htcondor python bindings
khurtado Feb 19, 2019
5530ddb
Update Dockerfile
khurtado Feb 19, 2019
37f3cb9
Update Dockerfile
khurtado Feb 19, 2019
0385162
Update Dockerfile
khurtado Feb 19, 2019
7fc09fe
Test htcondor python bindings
khurtado Feb 19, 2019
50d4c8f
Add htcondor
khurtado Feb 19, 2019
9844e83
Set /tmp as the initialdir
khurtado Feb 20, 2019
8e363be
Initial changes for HTcondor job backend.
CodyKank Feb 21, 2019
94c61e9
Merge pull request #1 from CodyKank/master
khurtado Feb 27, 2019
9243471
Merge branch 'master' of https://github.com/reanahub/reana-job-contro…
CodyKank Mar 6, 2019
031417c
Merge pull request #2 from CodyKank/master
khurtado Mar 6, 2019
af37025
Execute condor submit as a user.
khurtado Mar 13, 2019
21ba4c3
Add first version of a singularity job wrapper.
khurtado Mar 14, 2019
e1646e9
Job wrapper changes.
khurtado Mar 26, 2019
889c952
Job wrapper improvements.
khurtado Mar 28, 2019
66da263
Add limited support for transferring output files via condor_chirp.
khurtado Apr 4, 2019
71d2ad3
Merge branch 'master' of https://github.com/reanahub/reana-job-contro…
CodyKank Apr 8, 2019
8238cab
Add htcondor_job_manger.py
CodyKank Apr 9, 2019
53d4af7
Adjust config and rest for htcondor
CodyKank Apr 9, 2019
7019dbb
Merge pull request #3 from CodyKank/cody_dev
khurtado Apr 10, 2019
448b2a4
Some fixes for the condor job manager.
khurtado Apr 10, 2019
38a5b65
- Transfer input/output files via parrot
khurtado Apr 17, 2019
40bd56e
First implementation to search for shifter
CodyKank Apr 18, 2019
e4991c7
Add shifter to module search list in job_wrapper
CodyKank Apr 22, 2019
3951c04
Remove bash associative arrays.
CodyKank Apr 22, 2019
8ffa28c
Merge pull request #4 from CodyKank/job_manager
khurtado Apr 22, 2019
b6af97d
Updates on finding module utility
khurtado Apr 22, 2019
9bc053f
Merge branch 'master' of https://github.com/reanahub/reana-job-contro…
Apr 24, 2019
70cd180
Merge branch 'reanahub-master' into job_manager
Apr 24, 2019
d783329
Fix schedd dependency.
Apr 24, 2019
0bea8d0
Fix typo when querying list of modules.
Apr 24, 2019
8540014
Send singularity cache dir to scratch areas.
khurtado Apr 24, 2019
dd08a86
Merge branch 'job_manager' of https://github.com/khurtado/reana-job-c…
khurtado Apr 24, 2019
556f32a
Add execution steps for shifter
CodyKank Apr 25, 2019
5e0a351
Merge pull request #6 from khurtado/job_manager
khurtado May 13, 2019
11b1c92
Change image pull method for shifter
CodyKank May 14, 2019
3f9b215
Remove debugging messages
CodyKank May 14, 2019
e5926c9
Rework directory binding and chdir for containers
CodyKank May 15, 2019
02f77a2
Modify container search methods
CodyKank May 30, 2019
e1db2d2
Cleanup comments. Move check for modules
CodyKank Jun 3, 2019
5b9e5f5
Reformated singularity and Shifter executions:job_wrapper.sh
CodyKank Jun 4, 2019
e0c70bf
Comment cleanup
CodyKank Jun 4, 2019
a34e8c5
Merge pull request #7 from CodyKank/job_manager
khurtado Jun 4, 2019
f5a69d8
Merge pull request #1 from scailfin/job_manager
CodyKank Jun 19, 2019
2e2ed50
Added check for aprun on cray / BW systems
CodyKank Jun 19, 2019
7d25d63
Changing parrot timeout
CodyKank Jun 20, 2019
b40d1ca
Merge remote-tracking branch 'reana/master' into code_merge
CodyKank Aug 19, 2019
594ac68
rename vc3 job manager
CodyKank Aug 20, 2019
46157f9
Rename vc3 RJC class name
CodyKank Aug 20, 2019
9ebfb15
Change SHARED_PATH_ROOT to obtain from flask app
CodyKank Aug 27, 2019
5614c24
Merge pull request #11 from scailfin/code_merge
khurtado Aug 29, 2019
1a11c04
Do not add a vc3user
khurtado Aug 29, 2019
c787230
Use WORKFLOW_RUNTIME variables.
khurtado Aug 29, 2019
1018720
Move some vars out of config
khurtado Aug 30, 2019
54611b9
Fix marshmallow version
khurtado Aug 30, 2019
a046739
Monitoring and VC3 variable changes.
khurtado Aug 30, 2019
341a500
Bug fix
khurtado Sep 4, 2019
c544033
Merge branch 'master' of https://github.com/reanahub/reana-job-contro…
khurtado Sep 6, 2019
0cce571
Fix VC3 support.
khurtado Sep 6, 2019
dd61407
Merge tag 'v0.6.0' into vc3v2p2
Jan 28, 2020
87ec2c2
Fix typo in config
Jan 29, 2020
e56d983
Change init parameters in VC3 job manager
Jan 29, 2020
1f68d9f
Revert back to HTCondor 8.9.1
Jan 29, 2020
8b2cfd0
Update VC3 watch monitor method.
Jan 31, 2020
fb893cd
Debug mode
Jan 31, 2020
1643750
Add more debugging lines
Jan 31, 2020
039c5e2
Sorting history results
Feb 1, 2020
da97df0
Working around kubernetes version issues
Apr 6, 2020
70bfbbc
Remove HTCONDOR address line from dockerfile
CodyKank Apr 13, 2020
5afab13
Change HTCondor version to 8.9.6
CodyKank Apr 13, 2020
0549154
Bump htcondor in setup.py
CodyKank Apr 14, 2020
189814b
Remove outdated files.
CodyKank Apr 14, 2020
49cb367
Correct function documentation from run_tests.sh
CodyKank Apr 27, 2020
0e76609
Add Kenyi and Cody to AUTHORS.rst.
CodyKank May 11, 2020
004d064
Clean up old variables no longer needed for VC3 implementation.
khurtado May 11, 2020
c5ef635
Move htcondorvc3's job_wrapper.
CodyKank May 22, 2020
48896bc
Remove extra htcondor installation in dockerfile.
CodyKank May 22, 2020
6c41e90
Remove comments from config.py. Reinstate k8s as default compute back…
CodyKank May 26, 2020
643d027
Update htcondorvc3_job_manager.py
khurtado May 13, 2021
1eed46c
Update Dockerfile
khurtado May 14, 2021
c10886f
Update job_wrapper.sh
khurtado May 14, 2021
a17a4b3
Fix typo in Dockerfile
May 14, 2021
3c23c31
Update Dockerfile
khurtado May 14, 2021
a651465
Update config.py
khurtado May 14, 2021
9f524b4
Update config.py using vc3 backend as default
khurtado May 14, 2021
b8684b5
Workaround to define input files on madminer workflow, rather than co…
Feb 8, 2022
37b865b
Fix bug while parsing command arguments
khurtado Feb 8, 2022
324425b
Strip double quotes when parsing arguments
khurtado Feb 8, 2022
1010e51
Update job_wrapper.sh
khurtado Oct 13, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ The list of contributors in alphabetical order:
- `Anton Khodak <https://orcid.org/0000-0003-3263-4553>`_
- `Diego Rodriguez <https://orcid.org/0000-0003-0649-2002>`_
- `Dinos Kousidis <https://orcid.org/0000-0002-4914-4289>`_
- `Cody Kankel <https://github.com/CodyKank>`_
- `Jan Okraska <https://orcid.org/0000-0002-1416-3244>`_
- `Kenyi Hurtado-Anampa <https://orcid.org/0000-0002-9779-3566>`_
- `Rokas Maciulaitis <https://orcid.org/0000-0003-1064-6967>`_
- `Sinclert Perez <https://www.linkedin.com/in/sinclert>`_
- `Tibor Simko <https://orcid.org/0000-0001-7202-5803>`_
6 changes: 5 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,11 @@
FROM python:3.6-slim

ENV TERM=xterm

RUN apt-get update && \
apt-get install -y vim-tiny && \
pip install --upgrade pip
pip install --upgrade pip && \
pip install htcondor==8.9.6 retrying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have these libraries installed at this specific point in the Dockerfile, maybe because of the deb packages? Otherwise, we could just continue setting them in setup.py so they will be installed at this point, and later available in the code .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, these should be handled in setup.py, I'll clean these up.


RUN export DEBIAN_FRONTEND=noninteractive ;\
apt-get -yq install krb5-user \
Expand Down Expand Up @@ -55,6 +57,7 @@ RUN update-ca-certificates

COPY CHANGES.rst README.rst setup.py /code/
COPY reana_job_controller/version.py /code/reana_job_controller/
COPY reana_job_controller/htcondor_submit.py /code/htcondor_submit.py
WORKDIR /code
RUN pip install requirements-builder && \
requirements-builder -l pypi setup.py | pip install -r /dev/stdin && \
Expand All @@ -77,5 +80,6 @@ EXPOSE 5000

ENV COMPUTE_BACKENDS $COMPUTE_BACKENDS
ENV FLASK_APP reana_job_controller/app.py
ENV REANA_LOG_LEVEL DEBUG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be necessary to hard-code REANA_LOG_LEVEL in Dockerfile. One could pass it as environment variable...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add a bit more information on this: the way of passing this configuration is a bit cumbersome right now as one would have to: go inside RWC, read the config value and pass it as an environment variable to RJC. In reanahub/reana#277 there is a description of the current process and possible solutions we could take to centralise these configuration variables.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should have been taken out, I missed this when cleaning things up.


CMD ["flask", "run", "-h", "0.0.0.0"]
215 changes: 215 additions & 0 deletions files/job_wrapper.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
#!/bin/bash

# Replicate input files directory structure
# @TODO: This could be executed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have:

etc/job_wrapper.sh
files/job_wrapper.sh

It is not clear which files are for what.

What about using:

etc/htcondorcern/job_wrapper.sh
etc/htcondorvc3/job_wrapper.sh

so that we can easily support several backends and distinguish between them?

This would mean that we'd have to move quite a few CERN-specific files from etc/* to /etc/htcondorcern/*` ourselves... but merging this branch will be quite a lot of work anyway, so could perhaps do it alongside?!

(An alternative would be to use the same directory but strictly "cern" and "vc" naming prefix everywhere.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like creating different subdirectories in etc:

etc/htcondorvc3/job_wrapper.sh
etc/htcondorcern/job_wrapper.sh

As you said, handling several different backends would be easier to manage and visualize than prefixes or suffixes (job_wrapper_cern.sh / cern_job_wrapper.sh).

If you're OK with the subdirectories in etc/, I'd be happy to help move CERN specific files as well as the VC3 files.

# in +PreCmd as a separate script.

# Expected arguments from htcondor_job_manager:
# $1: workflow_workspace
# $2: DOCKER_IMG
# $3 -> : cmd

# Defining inputs
DOCKER_IMG=$2
REANA_WORKFLOW_DIR=$1

# Get static version of parrot.
# Note: We depend on curl for this.
# Assumed to be available on HPC worker nodes (might need to transfer a static version otherwise).
get_parrot(){
curl --retry 5 -o parrot_static_run http://download.virtualclusters.org/builder-files/parrot_static_run_v7.0.11 > /dev/null 2>&1
if [ -e "parrot_static_run" ]; then
chmod +x parrot_static_run
else
echo "[Error] Could not download parrot" >&2
exit 210
fi
}

populate(){
if [ ! -x "$_CONDOR_SCRATCH_DIR/parrot_static_run" ]; then get_parrot; fi
mkdir -p "$_CONDOR_SCRATCH_DIR/$REANA_WORKFLOW_DIR"
local parent="$(dirname $REANA_WORKFLOW_DIR)"
$_CONDOR_SCRATCH_DIR/parrot_static_run -T 4 cp --no-clobber -r "/chirp/CONDOR/$REANA_WORKFLOW_DIR" "$_CONDOR_SCRATCH_DIR/$parent"
}

find_module(){
module > /dev/null 2>&1
if [ $? == 0 ]; then
return 0
elif [ -e /etc/profile.d/modules.sh ]; then
source /etc/profile.d/modules.sh
fi
module > /dev/null 2>&1
return $?
}

# Discover the container technology available.
# Currently searching for: Singularity or Shifter.
# Returns 0: Successful discovery of a container
# 1: Couldn't find a container
find_container(){
declare -a search_list=("singularity" "shifter")
declare -a found_list=()
local default="shifter"
local cont_found=false


for cntr in "${search_list[@]}"; do
cntr_path="$(command -v $cntr)"
if [[ -x "$cntr_path" ]] # Checking binaries in path
then
if [ "$(basename "$cntr_path")" == "$default" ]; then
CONTAINER_PATH="$cntr_path"
return 0
else
found_list+=("$cntr_path")
cont_found=true
fi
fi
done
# If VC3 didn't automatically load a module (fail-safe)
if [ ! "$cont_found" ]; then
for cntr in "${search_list[@]}"; do
find_module
module_found=$?
if [ $module_found == 0 ]; then
for var in ${search_list[*]}; do
module load $var 2>/dev/null
var_path="$(command -v $var 2>/dev/null)"
if [ "$(basename "$var_path")" == "$default" ]; then
CONTAINER_PATH="$var_path"
return 0
else
found_list+=("$var_path")
cont_found=true
fi
done
fi
done
fi

# If default wasn't found but a container was found, use that
if (( "${#found_list[@]}" >= 1 )); then
CONTAINER_PATH=${found_list[0]}
return 0
else
return 1 # No containers found
fi
}

# Setting up cmd line args for singularity
# Print's stdout the argument line for running singularity utilizing
setup_singularity(){
# TODO: Cleanup calling of this function

# Send cache to $SCRATCH or to the condor scratch directory
# otherwise
if [ -z "$SCRATCH" ]; then
CONTAINER_ENV="SINGULARITY_CACHEDIR=\"\$_CONDOR_SCRATCH_DIR\""
else
CONTAINER_ENV="SINGULARITY_CACHEDIR=\"\$SCRATCH\""
fi

CNTR_ARGUMENTS="exec -B ./$REANA_WORKFLOW_DIR:$REANA_WORKFLOW_DIR docker://$DOCKER_IMG"

}

# Setting up shifter. Pull the docker_img into the shifter image gateway
# and dump required arguments into stdout to be collected by a function call
setup_shifter(){
#TODO: Cleanup calling of this function
# Check for shifterimg
if [[ ! $(command -v shifterimg 2>/dev/null) ]]; then
echo "Error: shifterimg not found..." >&2
exit 127
fi

# Attempt to pull image into image-gateway
if ! shifterimg pull "$DOCKER_IMG" >/dev/null 2>&1; then
echo "Error: Could not pull img: $DOCKER_IMG" >&2
exit 127
fi

# Put arguments into stdout to collect.
echo "--image=docker:${DOCKER_IMG} --volume=$(pwd -P)/reana:/reana -- "
}

# Setting up the arguments to pass to a container technology.
# Currently able to setup: Singularity and Shifter.
# Creates cmd line arguements for containers and pull image if needed (shifter)
# Global arguments is used as the arguments to a container
setup_container(){
# Need to cleanup to make more automated.
# i.e. run through the same list in find_container
local container=$(basename "$CONTAINER_PATH")

if [ "$container" == "singularity" ]; then
setup_singularity
elif [ "$container" == "shifter" ]; then
CNTR_ARGUMENTS=$(setup_shifter)
else
echo "Error: Unrecognized container: $(basename $CONTAINER_PATH)" >&2
exit 127
fi
}

######## Setup environment #############
# @TODO: This should be done in a prologue
# in condor via +PreCmd, eventually.
#############################

find_container
if [ $? != 0 ]; then
echo "[Error]: Container technology could not be found in the sytem." >&2
exit 127
fi
populate
setup_container

######## Execution ##########
# Note: Double quoted arguments are broken
# and passed as multiple arguments
# in bash for some reason, working that
# around by dumping command to a
# temporary wrapper file named tmpjob.
tmpjob=$(mktemp -p .)
chmod +x $tmpjob
if command -v aprun; then
echo -n "aprun -b -n 1 -- " > $tmpjob
fi

echo "$CONTAINER_ENV" "$CONTAINER_PATH" "$CNTR_ARGUMENTS" "${@:3} " >> $tmpjob
bash $tmpjob
res=$?
rm $tmpjob

if [ $res != 0 ]; then
echo "[Error] Execution failed with error code: $res" >&2
exit $res
fi

###### Stageout ###########
# TODO: This shoul be done in an epilogue
# via +PostCmd, eventually.
# Not implemented yet.
# Read files from $reana_workflow_outputs
# and write them into $REANA_WORKFLOW_DIR
# Stage out depending on the protocol
# E.g.:
# - file: will be transferred via condor_chirp
# - xrootd://<redirector:port>//store/user/path:file: will be transferred via XRootD
# Only chirp transfer supported for now.
# Use vc3-builder to get a static version
# of parrot (eventually, a static version
# of the chirp client only).
if [ "x$REANA_WORKFLOW_DIR" == "x" ]; then
echo "[Info]: Nothing to stage out"
exit $res
fi

parent="$(dirname $REANA_WORKFLOW_DIR)"
# TODO: Check for parrot exit code and propagate it in case of errors.
./parrot_static_run -T 4 cp --no-clobber -r "$_CONDOR_SCRATCH_DIR/$REANA_WORKFLOW_DIR" "/chirp/CONDOR/$parent"

exit $res
26 changes: 17 additions & 9 deletions reana_job_controller/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,29 +16,32 @@
HTCondorJobManagerCERN
from reana_job_controller.job_monitor import (JobMonitorHTCondorCERN,
JobMonitorKubernetes,
JobMonitorSlurmCERN)
JobMonitorSlurmCERN,
JobMonitorHTCondorVC3)
from reana_job_controller.kubernetes_job_manager import KubernetesJobManager
from reana_job_controller.slurmcern_job_manager import SlurmJobManagerCERN

SHARED_VOLUME_PATH_ROOT = os.getenv('SHARED_VOLUME_PATH_ROOT', '/var/reana')
"""Root path of the shared volume ."""
from reana_job_controller.htcondorvc3_job_manager import \
HTCondorJobManagerVC3
from reana_job_controller.variables import (MAX_JOB_RESTARTS,
SHARED_VOLUME_PATH_ROOT)

COMPUTE_BACKENDS = {
'kubernetes': KubernetesJobManager,
'htcondorcern': HTCondorJobManagerCERN,
'slurmcern': SlurmJobManagerCERN
'slurmcern': SlurmJobManagerCERN,
'htcondorvc3' : HTCondorJobManagerVC3
}
"""Supported job compute backends and corresponding management class."""

JOB_MONITORS = {
'kubernetes': JobMonitorKubernetes,
'htcondorcern': JobMonitorHTCondorCERN,
'slurmcern': JobMonitorSlurmCERN,
'htcondorvc3': JobMonitorHTCondorVC3
}
"""Classes responsible for monitoring specific backend jobs"""


DEFAULT_COMPUTE_BACKEND = 'kubernetes'
DEFAULT_COMPUTE_BACKEND = 'htcondorvc3'
"""Default job compute backend."""

JOB_HOSTPATH_MOUNTS = []
Expand Down Expand Up @@ -66,8 +69,13 @@
``/usr/local/share/mydata`` in the host machine.
"""

SUPPORTED_COMPUTE_BACKENDS = os.getenv('COMPUTE_BACKENDS',
DEFAULT_COMPUTE_BACKEND).split(",")
# How is this set in the environment?
# It is hardcoded in Dockerfile and there is no code
# in workflow-controller to override that for the job-controller.
#SUPPORTED_COMPUTE_BACKENDS = os.getenv('COMPUTE_BACKENDS',
# DEFAULT_COMPUTE_BACKEND).split(",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point for improvement 👍. The way this is set is a bit shady right now as one could pass it in REANA-Workflow-Controller with the process described in https://github.com/reanahub/reana-job-controller/pull/251/files#r416428666, but actually this is set at build time because that knowledge is used to selectively install the correct packages.

So when we build images for reana.cern.ch we use the Makefile like:

BUILD_ARGUMENTS="COMPUTE_BACKENDS=htcondorcern,slurmcern,kubernetes" BUILD_TYPE=release make build

What is happening behind the scenes is that we pass these build args to REANA-Job-Controller docker build, which results in the final image having this environment variable.

Side note, as it might be of interest here: Then we take this image and we name it reanahub/reana-job-controller-htcondorcern-slurmcern, not very pretty but this is what we have for now :) (see previous discussions here).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the note Diego, it's very helpful! We'll be sure to follow that from now on. Perhaps we've been 'cheating' while building our images, we have simply been building them straight from the dockerfile within the RJC repo rather than through the reanahub/reana repo. I suppose this is a bad habit of ours.

SUPPORTED_COMPUTE_BACKENDS = DEFAULT_COMPUTE_BACKEND.split(",")

"""List of supported compute backends provided as docker build arg."""

KRB5_CONTAINER_IMAGE = os.getenv('KRB5_CONTAINER_IMAGE',
Expand Down
2 changes: 1 addition & 1 deletion reana_job_controller/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ def create_app(JOB_DB=None, config_mapping=None):
app.config.from_object(config)
if config_mapping:
app.config.from_mapping(config_mapping)
if 'htcondorcern' in app.config['SUPPORTED_COMPUTE_BACKENDS']:
if 'htcondorcern' or 'htcondorvc3' in app.config['SUPPORTED_COMPUTE_BACKENDS']:
app.htcondor_executor = ThreadPoolExecutor(max_workers=1)
with app.app_context():
app.config['OPENAPI_SPEC'] = build_openapi_spec()
Expand Down
Loading