Skip to content

Commit

Permalink
Update conda environment and build. (#4749)
Browse files Browse the repository at this point in the history
  • Loading branch information
cmnbroad authored Jun 1, 2018
1 parent 1913c3f commit 7628cc9
Show file tree
Hide file tree
Showing 5 changed files with 95 additions and 28 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ tmp/
src/main/java/local/
out/
*.pyc
gatkcondaenv.yml
gatkcondaenv.intel.yml
gatkPythonPackageArchive.zip

#Please don't commit me
client_secret.json
Expand Down
7 changes: 4 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ ARG DRELEASE
ADD . /gatk

WORKDIR /gatk
RUN /gatk/gradlew clean compileTestJava sparkJar localJar createPythonPackageArchive -Drelease=$DRELEASE
RUN /gatk/gradlew clean compileTestJava sparkJar localJar condaEnvironmentDefinition -Drelease=$DRELEASE

WORKDIR /root

Expand Down Expand Up @@ -40,9 +40,10 @@ RUN mkdir $DOWNLOAD_DIR && \
bash $DOWNLOAD_DIR/miniconda.sh -p $CONDA_PATH -b && \
rm $DOWNLOAD_DIR/miniconda.sh
ENV PATH $CONDA_PATH/envs/gatk/bin:$CONDA_PATH/bin:$PATH
WORKDIR /gatk
RUN conda env create -n gatk -f /gatk/scripts/gatkcondaenv.yml && \
WORKDIR /gatk/build
RUN conda env create -n gatk -f gatkcondaenv.yml && \
echo "source activate gatk" >> /gatk/gatkenv.rc
WORKDIR /gatk

CMD ["bash", "--init-file", "/gatk/gatkenv.rc"]

Expand Down
37 changes: 26 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,17 +54,7 @@ releases of the toolkit.
* Java 8
* Python 2.6 or greater (required to run the `gatk` frontend script)
* Python 3.6.2, along with a set of additional Python packages, is required to run some tools and workflows.
GATK uses the [Conda](https://conda.io/docs/index.html) package manager to establish and manage the
environment and dependencies required by these tools. The GATK Docker image comes with this environment
pre-configured. In order to establish an environment suitable to run these tools outside of the Docker image, the
conda [gatkcondaenv.yml](https://github.com/broadinstitute/gatk/blob/master/scripts/gatkcondaenv.yml) file is
provided. To establish the conda environment locally, [Conda](https://conda.io/docs/index.html) must first
be installed. Then, create the gatk environment by running the command ```conda env create -n gatk -f gatkcondaenv.yml```
(developers should run ```./gradlew createPythonPackageArchive```, followed by
```conda env create -n gatk -f scripts/gatkcondaenv.yml``` from within the root of the repository clone).
To activate the environment once it has been created, run the command ```source activate gatk```. See the
[Conda](https://conda.io/docs/user-guide/tasks/manage-environments.html) documentation for
additional information about using and managing Conda environments.
See [Python Dependencies](#python) for more information.
* R 3.2.5 (needed for producing plots in certain tools)
* To build GATK:
* A Java 8 JDK
Expand All @@ -82,6 +72,31 @@ releases of the toolkit.
* Pre-packaged Docker images with all needed dependencies installed can be found on
[our dockerhub repository](https://hub.docker.com/r/broadinstitute/gatk/). This requires a recent version of the
docker client, which can be found on the [docker website](https://www.docker.com/get-docker).
* Python Dependencies:<a name="python"></a>
* GATK4 uses the [Conda](https://conda.io/docs/index.html) package manager to establish and manage the
Python environment and dependencies required by GATK tools that have a Python dependency. There are two different
conda environments that can be used:
* The ```gatk``` environment, which has no special hardware requirements. The GATK Docker image comes with the
"gatk" environment pre-configured.
* The ```gatk-intel``` environment, which requires and uses Intel (AVX2 or AVX-512) hardware acceleration to
increase performance.
* To establish the conda environment when not using the Docker image, a conda environment must first be "created", and
then "activated":
* First, make sure [Miniconda or Conda](https://conda.io/docs/index.html) is installed (Miniconda is sufficient).
* To "create" the conda environment:
* If running from a zip or tar distribution, run the command ```conda env create -f gatkcondaenv.yml``` to
create the ```gatk``` environment, or the command ```conda env create -f gatkcondaenv.intel.yml``` to create
the ```gatk-intl``` environment.
* If running from a cloned repository, run ```./gradlew localDevCondaEnv```. This generates the Python
package archive and conda yml dependency file(s) in the build directory, and also creates (or updates)
the local ```gatk``` conda environment. (To create the ```gatk-intel``` conda environment once the files
have been generated, run the command ```conda env create -f gatkcondaenv.intel.yml```).
* To "activate" the conda environment (the conda environment must be activated within the same shell from which
GATK is run):
* Execute the shell command ```source activate gatk``` to activate the ```gatk``` environment, or
```source activate gatk-intel``` to activate the ```gatk-intel``` environment.
* See the [Conda](https://conda.io/docs/user-guide/tasks/manage-environments.html) documentation for
additional information about using and managing Conda environments.

## <a name="quickstart">Quick Start Guide</a>

Expand Down
68 changes: 58 additions & 10 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ final sparkVersion = System.getProperty('spark.version', '2.2.0')
final hadoopVersion = System.getProperty('hadoop.version', '2.8.2')
final hadoopBamVersion = System.getProperty('hadoopBam.version','7.10.0')
final genomicsdbVersion = System.getProperty('genomicsdb.version','0.9.2-proto-3.0.0-beta-1+uuid-static')
final tensorflowVersion = System.getProperty('tensorflow.version','1.4.0')
final testNGVersion = '6.11'
// Using the shaded version to avoid conflicts between its protobuf dependency
// and that of Hadoop/Spark (either the one we reference explicitly, or the one
Expand All @@ -74,6 +75,9 @@ final baseJarName = 'gatk'
final secondaryBaseJarName = 'hellbender'
final docBuildDir = "$buildDir/docs"
final pythonPackageArchiveName = 'gatkPythonPackageArchive.zip'
final gatkCondaTemplate = "gatkcondaenv.yml.template"
final gatkCondaYML = "gatkcondaenv.yml"
final gatkCondaIntelYML = "gatkcondaenv.intel.yml"
final largeResourcesFolder = "src/main/resources/large"
final buildPrerequisitesMessage = "See https://github.com/broadinstitute/gatk#building for information on how to build GATK."

Expand Down Expand Up @@ -518,7 +522,7 @@ task sparkJar(type: ShadowJar) {
}

task bundle(type: Zip) {
dependsOn shadowJar, sparkJar, 'gatkTabComplete', 'createPythonPackageArchive', 'gatkDoc'
dependsOn shadowJar, sparkJar, 'condaEnvironmentDefinition', 'gatkTabComplete', 'gatkDoc'

doFirst {
assert file("gatk").exists()
Expand All @@ -542,27 +546,58 @@ task bundle(type: Zip) {
rename 'GATKConfig.properties', 'GATKConfig.EXAMPLE.properties'
}

from("${buildDir}/${pythonPackageArchiveName}")
// When including gatkcondaenv.yml file in the release bundle, strip off the
// 'build/' prefix used for the location of the Python package archive.
from("scripts/gatkcondaenv.yml", {
filter { line -> line.replace("build/${pythonPackageArchiveName}", pythonPackageArchiveName) }
})
from("$buildDir/$pythonPackageArchiveName")
from("$buildDir/$gatkCondaYML")
from("$buildDir/$gatkCondaIntelYML")
into(baseName)

doLast {
logger.lifecycle("Created GATK distribution in ${destinationDir}/${archiveName}")
}
}

task createPythonPackageArchive(type: Zip) {
task condaStandardEnvironmentDefinition(type: Copy) {
from "scripts"
into buildDir
include gatkCondaTemplate
rename { file -> gatkCondaYML }
expand(["condaEnvName":"gatk",
"condaEnvDescription" : "Conda environment for GATK Python Tools",
"tensorFlowDependency" : "tensorflow==$tensorflowVersion"])
doLast {
logger.lifecycle("Created standard Conda environment yml file: $gatkCondaYML")
}
}

task condaIntelEnvironmentDefinition(type: Copy) {
from "scripts"
into buildDir
include gatkCondaTemplate
rename { file -> gatkCondaIntelYML }
expand(["condaEnvName":"gatk-intel",
"condaEnvDescription" : "Conda environment for GATK Python Tools running with Intel hardware acceleration",
"tensorFlowDependency" :
"https://anaconda.org/intel/tensorflow/$tensorflowVersion/download/tensorflow-$tensorflowVersion-cp36-cp36m-linux_x86_64.whl"])
doLast {
logger.lifecycle("Created Intel Conda environment yml file: $gatkCondaIntelYML")
}
}

// Create two GATK conda environment yml files from the conda enc template
// (one for standard GATK and one for running GATK with Intel hardware).
task condaEnvironmentDefinition() {
dependsOn 'pythonPackageArchive', 'condaStandardEnvironmentDefinition', 'condaIntelEnvironmentDefinition'
}

// Create the Python package archive file
task pythonPackageArchive(type: Zip) {
inputs.dir "src/main/python/org/broadinstitute/hellbender/"
outputs.file pythonPackageArchiveName
doFirst {
logger.lifecycle("Creating GATK Python package archive...")
assert file("src/main/python/org/broadinstitute/hellbender/").exists()
}

destinationDir file("$buildDir")
destinationDir file("${buildDir}")
archiveName pythonPackageArchiveName
from("src/main/python/org/broadinstitute/hellbender/")
into("/")
Expand All @@ -572,6 +607,19 @@ task createPythonPackageArchive(type: Zip) {
}
}

// Creates a standard, local, GATK conda environment, for use by developers during iterative
// development. Assumes conda or miniconda is already installed.
//
// NOTE: This CREATES a local conda environment; but does not *activate* it. The environment must
// be activated manually in the shell from which GATK will be run.
//
task localDevCondaEnv(type: Exec) {
dependsOn 'condaEnvironmentDefinition'
inputs.file("$buildDir/$pythonPackageArchiveName")
workingDir "$buildDir"
commandLine "conda", "env", "update", "-f", gatkCondaYML
}

task javadocJar(type: Jar, dependsOn: javadoc) {
classifier = 'javadoc'
from "$docBuildDir/javadoc"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Conda environment for GATK Python Tools
# $condaEnvDescription
#
name: gatk
name: $condaEnvName
channels:
- defaults
dependencies:
Expand Down Expand Up @@ -43,9 +43,9 @@ dependencies:
- scikit-learn==0.19.1
- scipy==1.0.0
- six==1.11.0
- tensorflow==1.4.0
- $tensorFlowDependency
- tensorflow-tensorboard==0.4.0rc3
- theano==0.9.0
- tqdm==4.19.4
- werkzeug==0.12.2
- build/gatkPythonPackageArchive.zip
- gatkPythonPackageArchive.zip

0 comments on commit 7628cc9

Please sign in to comment.