diff --git a/.Rbuildignore b/.Rbuildignore index a9684fa4..9d27b332 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -4,11 +4,11 @@ \.swp$ ^\.ignore$ ^\.editorconfig$ -^.travis.yml$ +^\.travis\.yml$ ^man-roxygen$ ^appveyor\.yml$ ^.*\.Rproj$ ^\.Rproj\.user$ ^docs$ ^paper$ -^_pkgdown.yml$ +^_pkgdown\.yml$ diff --git a/NEWS.md b/NEWS.md index 266f8447..1e48c77b 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,6 +1,7 @@ # batchtools 0.9.4 * Fixed handling of `file.dir` with special chars like whitespace. +* All backward slashes will now be converted to forward slashes on windows. * Fixed order of arguments in `findExperiments()` (argument `ids` is now first). * Removed code to upgrade registries created with versions prior to v0.9.0 (first CRAN release). * `addExperiments()` now warns if a design is passed as `data.frame` with factor columns and `stringsAsFactors` is `TRUE`. @@ -45,4 +46,4 @@ # batchtools 0.9.0 Initial CRAN release. -See this [vignette](https://mllg.github.io/batchtools/articles/v01_Migration) for a brief comparison with [BatchJobs](https://cran.r-project.org/package=BatchJobs)/[BatchExperiments](https://cran.r-project.org/package=BatchExperiments). +See the vignette for a brief comparison with [BatchJobs](https://cran.r-project.org/package=BatchJobs)/[BatchExperiments](https://cran.r-project.org/package=BatchExperiments). diff --git a/README.md b/README.md index 46b25b66..d3f9bb25 100644 --- a/README.md +++ b/README.md @@ -32,15 +32,15 @@ The development of [BatchJobs](https://github.com/tudo-r/BatchJobs/) and [BatchE * Data base issues: Although we invested weeks to mitigate issues with locks of the SQLite data base or file system (staged queries, file system timeouts, ...), `BatchJobs` kept working unreliable on some systems with high latency or specific file systems. This made `BatchJobs` unusable for many users. [BatchJobs](https://github.com/tudo-r/BatchJobs/) and [BatchExperiments](https://github.com/tudo-r/Batchexperiments) will remain on CRAN, but new features are unlikely to be ported back. -See this [vignette](https://mllg.github.io/batchtools/articles/v01_Migration) for a comparison of the packages. +The [vignette](https://mllg.github.io/batchtools/articles) contains a section comparing the packages. ## Resources * [NEWS](https://mllg.github.io/batchtools/news/) * [Function reference](https://mllg.github.io/batchtools/reference) -* [Vignettes](https://mllg.github.io/batchtools/articles) +* [Vignette](https://mllg.github.io/batchtools/articles) * [JOSS Paper](http://dx.doi.org/10.21105/joss.00135): Short paper on batchtools. Please cite this if you use batchtools. -* [Paper on BatchJobs/BatchExperiments](http://www.jstatsoft.org/v64/i11): The described concept still holds for batchtools and most examples work analogously (see this [vignette](https://mllg.github.io/batchtools/articles/v01_Migration) for differences between the packages). +* [Paper on BatchJobs/BatchExperiments](http://www.jstatsoft.org/v64/i11): The described concept still holds for batchtools and most examples work analogously (see the [vignette](https://mllg.github.io/batchtools/articles) for differences between the packages). ## Citation Please cite the [JOSS paper](http://dx.doi.org/10.21105/joss.00135) using the following BibTeX entry: diff --git a/docs/LICENSE b/docs/LICENSE.html similarity index 63% rename from docs/LICENSE rename to docs/LICENSE.html index 65c5ca88..22064175 100644 --- a/docs/LICENSE +++ b/docs/LICENSE.html @@ -1,7 +1,98 @@ - GNU LESSER GENERAL PUBLIC LICENSE + + + + + + + + +License • batchtools + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + + +
+ +
+
+ + +
                   GNU LESSER GENERAL PUBLIC LICENSE
                        Version 3, 29 June 2007
 
- Copyright (C) 2007 Free Software Foundation, Inc. 
+ Copyright (C) 2007 Free Software Foundation, Inc. <http://fsf.org/>
  Everyone is permitted to copy and distribute verbatim copies
  of this license document, but changing it is not allowed.
 
@@ -12,29 +103,29 @@
 
   0. Additional Definitions.
 
-  As used herein, "this License" refers to version 3 of the GNU Lesser
-General Public License, and the "GNU GPL" refers to version 3 of the GNU
+  As used herein, "this License" refers to version 3 of the GNU Lesser
+General Public License, and the "GNU GPL" refers to version 3 of the GNU
 General Public License.
 
-  "The Library" refers to a covered work governed by this License,
+  "The Library" refers to a covered work governed by this License,
 other than an Application or a Combined Work as defined below.
 
-  An "Application" is any work that makes use of an interface provided
+  An "Application" is any work that makes use of an interface provided
 by the Library, but which is not otherwise based on the Library.
 Defining a subclass of a class defined by the Library is deemed a mode
 of using an interface provided by the Library.
 
-  A "Combined Work" is a work produced by combining or linking an
+  A "Combined Work" is a work produced by combining or linking an
 Application with the Library.  The particular version of the Library
-with which the Combined Work was made is also called the "Linked
-Version".
+with which the Combined Work was made is also called the "Linked
+Version".
 
-  The "Minimal Corresponding Source" for a Combined Work means the
+  The "Minimal Corresponding Source" for a Combined Work means the
 Corresponding Source for the Combined Work, excluding any source code
 for portions of the Combined Work that, considered in isolation, are
 based on the Application, and not on the Linked Version.
 
-  The "Corresponding Application Code" for a Combined Work means the
+  The "Corresponding Application Code" for a Combined Work means the
 object code and/or source code for the Application, including any data
 and utility programs needed for reproducing the Combined Work from the
 Application, but excluding the System Libraries of the Combined Work.
@@ -150,7 +241,7 @@
 
   Each version is given a distinguishing version number. If the
 Library as you received it specifies that a certain numbered version
-of the GNU Lesser General Public License "or any later version"
+of the GNU Lesser General Public License "or any later version"
 applies to it, you have the option of following the terms and
 conditions either of that published version or of any later version
 published by the Free Software Foundation. If the Library as you
@@ -163,3 +254,24 @@
 apply, that proxy's public statement of acceptance of any version is
 permanent authorization for you to choose that version for the
 Library.
+
+ +
+ +
+ + + +
+ + + diff --git a/docs/articles/batchtools.html b/docs/articles/batchtools.html new file mode 100644 index 00000000..cc3ca276 --- /dev/null +++ b/docs/articles/batchtools.html @@ -0,0 +1,635 @@ + + + + + + + +batchtools • batchtools + + + + + + +
+
+ + + +
+
+ + + + +
+
+

+Setup

+
+

+Cluster Functions

+

The communication with the batch system is managed via so-called cluster functions. They are created with the constructor makeClusterFunctions which defines how jobs are submitted on your system. Furthermore, you may provide functions to list queued/running jobs and to kill jobs.

+

Usually you do not have to start from scratch but can just use one of the cluster functions which ship with the package:

+ +

The communication with the batch system is managed via so-called cluster functions. They are created with the constructor makeClusterFunctions which defines how jobs are submitted on your system. Furthermore, you may provide functions to list queued/running jobs and to kill jobs.

+

Usually you do not have to start from scratch but can just use one of the cluster functions which ship with the package:

+ +

To use the package with the socket cluster functions, you would call the respective constructor makeClusterFunctionsSocket():

+
reg = makeRegistry(NA)
+
## Sourcing configuration file '/home/lang/.config/batchtools/config.R' ...
+
## Loading required package: methods
+
## Created registry in '/scratch/registry5a017e1e9ddf' using cluster functions 'Interactive'
+
reg$cluster.functions = makeClusterFunctionsSocket(2)
+

To make this selection permanent for this registry, save the Registry with saveRegistry. To make your cluster function selection permanent for a specific system across R sessions for all new Registries, you can set up a configuration file (see below).

+

If you have trouble debugging your cluster functions, you can enable the debug mode for extra output. To do so, install the debugme package and set the environment variable DEBUGME to batchtools before you load the batchtools package:

+
Sys.setenv(DEBUGME = "batchtools")
+library(batchtools)
+
+
+

+Template files

+

Many cluster functions require a template file as argument. These templates are used to communicate with the scheduler and contain placeholders to evaluate arbitrary R expressions. Internally, the brew package is used for this purpose. Some exemplary template files can be found here. It would be great if you would help expand this collection to cover more exotic configurations. To do so, please send your template via mail or open a new pull request.

+

Note that all variables defined in a JobCollection can be used inside the template. If you need to pass extra variables, you can set them via the argument resources of submitJobs.

+

If the flexibility which comes with templating is not sufficient, you can still construct a custom cluster function implementation yourself using the provided constructor.

+
+
+

+Configuration file

+

The configuration file can be used to set system specific options. Its default location depends on the operating system (see Registry), but for the first time setup you can put one in the current working directory (as reported by getwd()). In order to set the cluster function implementation, you would generate a file with the following content:

+
cluster.functions = makeClusterFunctionsInteractive()
+

The configuration file is parsed whenever you create or load a Registry. It is sourced inside of your registry which has the advantage that you can (a) access all of the parameters which are passed to makeRegistry and (b) you can also directly change them. Lets say you always want your working directory in your home directory and you always want to load the checkmate package on the nodes, you can just append these lines:

+
work.dir = "~"
+packages = union(packages, "checkmate")
+

See the documentation on Registry for a more complete list of supported configuration options.

+
+
+
+

+Migration

+

The development of BatchJobs and BatchExperiments is discontinued because of the following reasons:

+
    +
  • Maintainability: The packages BatchJobs and BatchExperiments are tightly connected which makes maintaining difficult. Changes have to be synchronized and tested against the current CRAN versions for compatibility. Furthermore, BatchExperiments violates CRAN policies by calling internal functions of BatchJobs.
  • +
  • Data base issues: Although we invested weeks to mitigate issues with locks of the SQLite data base or file system (staged queries, file system timeouts, …), BatchJobs kept working unreliable on some systems with high latency or specific file systems. This made BatchJobs unusable for many users.
  • +
+

BatchJobs and BatchExperiments will remain on CRAN, but new features are unlikely to be ported back.

+
+

+Internal changes

+
    +
  • batchtools does not use SQLite anymore. Instead, all the information is stored directly in the registry using data.tables acting as an in-memory database. As a side effect, many operations are much faster.
  • +
  • Nodes do not have to access the registry. submitJobs() stores a temporary object of type JobCollection on the file system which holds all the information necessary to execute a chunk of jobs via doJobCollection() on the node. This avoids file system locks because each job accesses only one file exclusively.
  • +
  • +ClusterFunctionsMulticore now uses the parallel package for multicore execution. ClusterFunctionsSSH can still be used to emulate a scheduler-like system which respects the work load on the local machine.
  • +
+
+
+

+Interface changes

+
    +
  • batchtools remembers the last created or loaded Registry and sets it as default registry. This way, you do not need to pass the registry around anymore. If you need to work with multiple registries simultaneously on the other hand, you can still do so by explicitly passing registries to the functions.
  • +
  • Most functions now return a data.table which is keyed with the job.id. This way, return values can be joined together easily and efficient (see this help page for some examples).
  • +
  • The building blocks of a problem has been renamed from static and dynamic to the more intuitive data and fun. Thus, algorithm function should have the formal arguments job, data and instance.
  • +
  • The function makeDesign has been removed. Parameters can be defined by just passing a data.frame or data.table to addExperiments. For exhaustive designs, use data.table::CJ().
  • +
+
+
+

+Template changes

+
    +
  • The scheduler should directly execute the command Rscript -e 'batchtools::doJobCollection(<filename>)'. There is no intermediate R source file like in BatchJobs.
  • +
  • All information stored in the object JobCollection can be accessed while brewing the template.
  • +
  • Some variable names have changed and need to be adapted, e.g. job.name is now job.hash.
  • +
  • Extra variables may be passed via the argument resoures of submitJobs.
  • +
+
+
+

+New features

+
    +
  • Support for Docker Swarm via ClusterFunctionsDocker.
  • +
  • Jobs can now be tagged and untagged to provide an easy way to group them.
  • +
  • Some resources like the number of CPUs are now optionally passed to parallelMap. This eases nested parallelization, e.g. to use multicore parallelization on the slave by just setting a resource on the master. See submitJobs() for an example.
  • +
  • +ClusterFunctions are now more flexible in general as they can define hook functions which will be called at certain events. ClusterFunctionsDocker is an example use case which implements a housekeeping routine. This routine is called every time before a job is about to get submitted to the scheduler (in the case: the Docker Swarm) via the hook pre.submit and every time directly after the registry synchronized jobs stored on the file system via the hook post.sync.
  • +
  • More new features are covered in the NEWS.
  • +
+
+
+

+Porting to batchtools

+

The following table assists in porting to batchtools by mapping BatchJobs/BatchExperiments functions to their counterparts in batchtools. The table does not cover functions which are (a) used only internally in BatchJobs and (b) functions which have not been renamed.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
BatchJobsbatchtools
addRegistryPackagesSet reg$packages or reg$namespaces, call saveRegistry +
addRegistrySourceDirs-
addRegistrySourceFilesSet reg$source, call saveRegistry +
batchExpandGrid +batchMap: batchMap(..., args = CJ(x = 1:3, y = 1:10)) +
batchMapQuickbtmapply
batchReduceResults-
batchUnexportbatchExport
filterResults-
getJobIdsfindJobs
getJobInfogetJobStatus
getJobmakeJob
getJobParamDfgetJobPars
loadResultsreduceResultsList
reduceResultsDataFramereduceResultsDataTable
reduceResultsMatrix +reduceResultsList + do.call(rbind, res) +
reduceResultsVectorreduceResultsDataTable
setJobFunction-
setJobNames-
showStatusgetStatus
+
+
+
+

+Example 1: Approximation of \(\pi\) +

+

To get a first insight into the usage of batchtools, we start with an exemplary Monte Carlo simulation to approximate \(\pi\). For background information, see Wikipedia.

+

First, a so-called registry object has to be created, which defines a directory where all relevant information, files and results of the computational jobs will be stored. There are two different types of registry objects: First, a regular Registry which we will use in this example. Second, an ExperimentRegistry which provides an alternative way to define computational jobs and thereby is tailored for a broad range of large scale computer experiments (see, for example, this vignette). Here, we use a temporary registry which is stored in the temp directory of the system and gets automatically deleted if you close the R session.

+
reg = makeRegistry(file.dir = NA, seed = 1)
+

For a permanent registry, set the file.dir to a valid path. It can then be reused later, e.g., when you login to the system again, by calling the function loadRegistry(file.dir).

+

When a registry object is created or loaded, it is stored for the active R session as the default. Therefore the argument reg will be ignored in functions calls of this example, assuming the correct registry is set as default. To get the current default registry, getDefaultRegistry can be used. To switch to another registry, use setDefaultRegistry().

+

First, we create a function which samples \(n\) points \((x_i, y_i)\) whereas \(x_i\) and \(y_i\) are distributed uniformly, i.e. \(x_i, y_i \sim \mathcal{U}(0,1)\). Next, the distance to the origin \((0, 0)\) is calculated and the fraction of points in the unit circle (\(d \leq 1\)) is returned.

+
piApprox = function(n) {
+  nums = matrix(runif(2 * n), ncol = 2)
+  d = sqrt(nums[, 1]^2 + nums[, 2]^2)
+  4 * mean(d <= 1)
+}
+piApprox(1000)
+
## [1] 3.136
+

We now parallelize piApprox() with batchtools: We create 10 jobs, each doing a MC simulation with \(10^5\) jobs. We use batchMap() to define the jobs (note that this does not yet start the calculation):

+
batchMap(fun = piApprox, n = rep(1e5, 10))
+
## Adding 10 jobs ...
+

The length of the vector or list defines how many different jobs are created, while the elements itself are used as arguments for the function. The function batchMap(fun, ...) works analogously to Map(f, ...) of the base package. An overview over the jobs and their IDs can be retrieved with getJobTable() which returns a data.frame with all relevant information:

+
names(getJobTable())
+
##  [1] "job.id"       "submitted"    "started"      "done"        
+##  [5] "error"        "memory"       "batch.id"     "log.file"    
+##  [9] "job.hash"     "time.queued"  "time.running" "n"           
+## [13] "tags"
+

Note that a unique job ID is assigned to each job. These IDs can be used to restrict operations to subsets of jobs. To actually start the calculation, call submitJobs(). The registry and the selected job IDs can be taken as arguments as well as an arbitrary list of resource requirements, which are to be handled by the cluster back end.

+
submitJobs(resources = list(walltime = 3600, memory = 1024))
+
## Submitting 10 jobs in 10 chunks using cluster functions 'Interactive' ...
+

In this example, a cap for the execution time (so-called walltime) and for the maximum memory requirements are set. The progress of the submitted jobs can be checked with getStatus().

+ +
## Syncing 10 files ...
+
## Status for 10 jobs:
+##   Submitted : 10 (100.0%)
+##   Started   : 10 (100.0%)
+##   Done      : 10 (100.0%)
+##   Error     :  0 (  0.0%)
+##   Queued    :  0 (  0.0%)
+##   Running   :  0 (  0.0%)
+##   Expired   :  0 (  0.0%)
+

The resulting output includes the number of jobs in the registry, how many have been submitted, have started to execute on the batch system, are currently running, have successfully completed, and have terminated due to an R exception. After jobs have successfully terminated, we can load their results on the master. This can be done in a simple fashion by using either loadResult(), which returns a single result exactly in the form it was calculated during mapping, or by using reduceResults(), which is a version of Reduce() from the base package for registry objects.

+ +
## [1] TRUE
+
mean(sapply(1:10, loadResult))
+
## [1] 3.140652
+
reduceResults(function(x, y) x + y) / 10
+
## [1] 3.140652
+

If you are absolutely sure that your function works, you can take a shortcut and use batchtools in an lapply fashion using btlapply(). This function creates a temporary registry (but you may also pass one yourself), calls batchMap(), wait for the jobs to terminate with waitForJobs() and then uses reduceResultsList() to return the results.

+
res = btlapply(rep(1e5, 10), piApprox)
+
## Sourcing configuration file '/home/lang/.config/batchtools/config.R' ...
+
## Created registry in '/scratch/registry5a0130bc7045' using cluster functions 'Interactive'
+
## Adding 10 jobs ...
+
## Submitting 10 jobs in 10 chunks using cluster functions 'Interactive' ...
+
## Syncing 10 files ...
+
mean(unlist(res))
+
## [1] 3.143484
+
+
+

+Example 2: Machine Learning

+

We stick to a rather simple, but not unrealistic example to explain some further functionalities: Applying two classification learners to the famous iris data set (Anderson 1935), vary a few hyperparameters and evaluate the effect on the classification performance.

+

First, we create a registry, the central meta-data object which records technical details and the setup of the experiments. We use an ExperimentRegistry where the job definition is split into creating problems and algorithms. See the paper on BatchJobs and BatchExperiments for a detailed explanation. Again, we use a temporary registry and make it the default registry.

+
library(batchtools)
+reg = makeExperimentRegistry(file.dir = NA, seed = 1)
+
+

+Problems and Algorithms

+

By adding a problem to the registry, we can define the data on which certain computational jobs shall work. This can be a matrix, data frame or array that always stays the same for all subsequent experiments. But it can also be of a more dynamic nature, e.g., subsamples of a dataset or random numbers drawn from a probability distribution . Therefore the function addProblem() accepts static parts in its data argument, which is passed to the argument fun which generates a (possibly stochastic) problem instance. For data, any R object can be used. If only data is given, the generated instance is data. The argument fun has to be a function with the arguments data and job (and optionally other arbitrary parameters). The argument job is an object of type Job which holds additional information about the job.

+

We want to split the iris data set into a training set and test set. In this example we use use subsampling which just randomly takes a fraction of the observations as training set. We define a problem function which returns the indices of the respective training and test set for a split with 100 * ratio% of the observations being in the test set:

+
subsample = function(data, job, ratio, ...) {
+  n = nrow(data)
+  train = sample(n, floor(n * ratio))
+  test = setdiff(seq_len(n), train)
+  list(test = test, train = train)
+}
+

addProblem() files the problem to the file system and the problem gets recorded in the registry.

+
data("iris", package = "datasets")
+addProblem(name = "iris", data = iris, fun = subsample, seed = 42)
+
## Adding problem 'iris'
+

The function call will be evaluated at a later stage on the workers. In this process, the data part will be loaded and passed to the function. Note that we set a problem seed to synchronize the experiments in the sense that the same resampled training and test sets are used for the algorithm comparison in each distinct replication.

+

The algorithms for the jobs are added to the registry in a similar manner. When using addAlgorithm(), an identifier as well as the algorithm to apply to are required arguments. The algorithm must be given as a function with arguments job, data and instance. Further arbitrary arguments (e.g., hyperparameters or strategy parameters) may be defined analogously as for the function in addProblem. The objects passed to the function via job and data are here the same as above, while via instance the return value of the evaluated problem function is passed. The algorithm can return any R object which will automatically be stored on the file system for later retrieval. Firstly, we create an algorithm which applies a support vector machine:

+
svm.wrapper = function(data, job, instance, ...) {
+  library("e1071")
+  mod = svm(Species ~ ., data = data[instance$train, ], ...)
+  pred = predict(mod, newdata = data[instance$test, ], type = "class")
+  table(data$Species[instance$test], pred)
+}
+addAlgorithm(name = "svm", fun = svm.wrapper)
+
## Adding algorithm 'svm'
+

Secondly, a random forest of classification trees:

+
forest.wrapper = function(data, job, instance, ...) {
+  library("ranger")
+  mod = ranger(Species ~ ., data = data[instance$train, ], write.forest = TRUE)
+  pred = predict(mod, data = data[instance$test, ])
+  table(data$Species[instance$test], pred$predictions)
+}
+addAlgorithm(name = "forest", fun = forest.wrapper)
+
## Adding algorithm 'forest'
+

Both algorithms return a confusion matrix for the predictions on the test set, which will later be used to calculate the misclassification rate.

+

Note that using the ... argument in the wrapper definitions allows us to circumvent naming specific design parameters for now. This is an advantage if we later want to extend the set of algorithm parameters in the experiment. The algorithms get recorded in the registry and the corresponding functions are stored on the file system.

+

Defined problems and algorithms can be queried:

+ +
## [1] "iris"
+ +
## [1] "svm"    "forest"
+

The flow to define experiments is summarized in the following figure:

+

+
+
+

+Creating jobs

+

addExperiments() is used to parametrize the jobs and thereby define computational jobs. To do so, you have to pass named lists of parameters to addExperiments(). The elements of the respective list (one for problems and one for algorithms) must be named after the problem or algorithm they refer to. The data frames contain parameter constellations for the problem or algorithm function where columns must have the same names as the target arguments. When the problem design and the algorithm design are combined in addExperiments(), each combination of the parameter sets of the two designs defines a distinct job. How often each of these jobs should be computed can be determined with the argument repls.

+
# problem design: try two values for the ratio parameter
+pdes = list(iris = data.table(ratio = c(0.67, 0.9)))
+
+# algorithm design: try combinations of kernel and epsilon exhaustively,
+# try different number of trees for the forest
+ades = list(
+  svm = CJ(kernel = c("linear", "polynomial", "radial"), epsilon = c(0.01, 0.1)),
+  forest = data.table(ntree = c(100, 500, 1000))
+)
+
+addExperiments(pdes, ades, repls = 5)
+
## Adding 60 experiments ('iris'[2] x 'svm'[6] x repls[5]) ...
+
## Adding 30 experiments ('iris'[2] x 'forest'[3] x repls[5]) ...
+

The jobs are now available in the registry with an individual job ID for each. The function summarizeExperiments() returns a table which gives a quick overview over all defined experiments.

+ +
##    problem algorithm .count
+## 1:    iris       svm     60
+## 2:    iris    forest     30
+
summarizeExperiments(by = c("problem", "algorithm", "ratio"))
+
##    problem algorithm ratio .count
+## 1:    iris       svm  0.67     30
+## 2:    iris       svm  0.90     30
+## 3:    iris    forest  0.67     15
+## 4:    iris    forest  0.90     15
+
+
+

+Before you submit

+

Before submitting all jobs to the batch system, we encourage you to test each algorithm individually. Or sometimes you want to submit only a subset of experiments because the jobs vastly differ in runtime. Another reoccurring task is the collection of results for only a subset of experiments. For all these use cases, findExperiments() can be employed to conveniently select a particular subset of jobs. It returns the IDs of all experiments that match the given criteria. Your selection can depend on substring matches of problem or algorithm IDs using prob.name or algo.name, respectively. You can also pass R expressions, which will be evaluated in your problem parameter setting (prob.pars) or algorithm parameter setting (algo.pars). The expression is then expected to evaluate to a Boolean value. Furthermore, you can restrict the experiments to specific replication numbers.

+

To illustrate findExperiments(), we will select two experiments, one with a support vector machine and the other with a random forest and the parameter ntree = 1000. The selected experiment IDs are then passed to testJob.

+
id1 = head(findExperiments(algo.name = "svm"), 1)
+print(id1)
+
##    job.id
+## 1:      1
+
id2 = head(findExperiments(algo.name = "forest", algo.pars = (ntree == 1000)), 1)
+print(id2)
+
##    job.id
+## 1:     71
+
testJob(id = id1)
+
## Generating problem instance for problem 'iris' ...
+## Applying algorithm 'svm' on problem 'iris' ...
+
##             pred
+##              setosa versicolor virginica
+##   setosa         17          0         0
+##   versicolor      0         16         2
+##   virginica       0          0        15
+
testJob(id = id2)
+
## Generating problem instance for problem 'iris' ...
+## Applying algorithm 'forest' on problem 'iris' ...
+
##             
+##              setosa versicolor virginica
+##   setosa         17          0         0
+##   versicolor      0         16         2
+##   virginica       0          1        14
+

If something goes wrong, batchtools comes with a bunch of useful debugging utilities (see separate vignette on error handling). If everything turns out fine, we can proceed with the calculation.

+
+
+

+Submitting and collecting results

+

To submit the jobs, we call submitJobs() and wait for all jobs to terminate using waitForJobs().

+ +
## Submitting 90 jobs in 90 chunks using cluster functions 'Interactive' ...
+ +
## Syncing 90 files ...
+
## [1] TRUE
+

After jobs are finished, the results can be collected with reduceResultsDataTable() where we directly extract the mean misclassification error:

+
results = reduceResultsDataTable(fun = function(res) (list(mce = (sum(res) - sum(diag(res))) / sum(res))))
+head(results)
+
##    job.id  mce
+## 1:      1 0.04
+## 2:      2 0.00
+## 3:      3 0.06
+## 4:      4 0.04
+## 5:      5 0.02
+## 6:      6 0.04
+

Next, we merge the results table with the table of job parameters using one of the join helpers provided by batchtools (here, we use an inner join):

+
tab = ijoin(getJobPars(), results)
+head(tab)
+
##    job.id problem algorithm ratio kernel epsilon ntree  mce
+## 1:      1    iris       svm  0.67 linear    0.01    NA 0.04
+## 2:      2    iris       svm  0.67 linear    0.01    NA 0.00
+## 3:      3    iris       svm  0.67 linear    0.01    NA 0.06
+## 4:      4    iris       svm  0.67 linear    0.01    NA 0.04
+## 5:      5    iris       svm  0.67 linear    0.01    NA 0.02
+## 6:      6    iris       svm  0.67 linear    0.10    NA 0.04
+

We now aggregate the results group-wise. You can use data.table, base::aggregate(), or the dplyr package for this purpose. Here, we use data.table to subset the table to jobs where the ratio is 0.67 and group by algorithm the algorithm hyperparameters:

+
tab[ratio == 0.67, list(mmce = mean(mce)), by = c("algorithm", "kernel", "epsilon", "ntree")]
+
##    algorithm     kernel epsilon ntree  mmce
+## 1:       svm     linear    0.01    NA 0.032
+## 2:       svm     linear    0.10    NA 0.032
+## 3:       svm polynomial    0.01    NA 0.088
+## 4:       svm polynomial    0.10    NA 0.088
+## 5:       svm     radial    0.01    NA 0.048
+## 6:       svm     radial    0.10    NA 0.048
+## 7:    forest         NA      NA   100 0.048
+## 8:    forest         NA      NA   500 0.052
+## 9:    forest         NA      NA  1000 0.044
+
+
+
+

+Example: Error Handling

+

In any large scale experiment many things can and will go wrong. The cluster might have an outage, jobs may run into resource limits or crash, subtle bugs in your code could be triggered or any other error condition might arise. In these situations it is important to quickly determine what went wrong and to recompute only the minimal number of required jobs.

+

Therefore, before you submit anything you should use testJob() to catch errors that are easy to spot because they are raised in many or all jobs. If external is set, this function runs the job without side effects in an independent R process on your local machine via Rscript similar as on the slave, redirects the output of the process to your R console, loads the job result and returns it. If you do not set external, the job is executed is in the currently running R session, with the drawback that you might be unable to catch missing variable declarations or missing package dependencies.

+

By way of illustration here is a small example. First, we create a temporary registry.

+
library(batchtools)
+reg = makeRegistry(file.dir = NA, seed = 1)
+

Ten jobs are created, two of them will throw an exception.

+
flakeyFunction <- function(value) {
+  if (value %in% c(2, 9)) stop("Ooops.")
+  value^2
+}
+batchMap(flakeyFunction, 1:10)
+
## Adding 10 jobs ...
+

Now that the jobs are defined, we can test jobs independently:

+
testJob(id = 1)
+
## [1] 1
+

In this case, testing the job with ID = 1 provides the appropriate result but testing the job with ID = 2 leads to an error:

+
as.character(try(testJob(id = 2)))
+
## [1] "Error in (function (value)  : Ooops.\n"
+

When you have already submitted the jobs and suspect that something is going wrong, the first thing to do is to run getStatus() to display a summary of the current state of the system.

+ +
## Submitting 10 jobs in 10 chunks using cluster functions 'Interactive' ...
+ +
## Syncing 10 files ...
+
## [1] FALSE
+ +
## Status for 10 jobs:
+##   Submitted : 10 (100.0%)
+##   Started   : 10 (100.0%)
+##   Done      :  8 ( 80.0%)
+##   Error     :  2 ( 20.0%)
+##   Queued    :  0 (  0.0%)
+##   Running   :  0 (  0.0%)
+##   Expired   :  0 (  0.0%)
+

The status message shows that two of the jobs could not be executed successfully. To get the IDs of all jobs that failed due to an error we can use findErrors() and to retrieve the actual error message, we can use getErrorMessages().

+ +
##    job.id
+## 1:      2
+## 2:      9
+ +
##    job.id terminated error                              message
+## 1:      2       TRUE  TRUE Error in (function (value)  : Ooops.
+## 2:      9       TRUE  TRUE Error in (function (value)  : Ooops.
+

If we want to peek into the R log file of a job to see more context for the error we can use showLog() which opens a pager or use getLog() to get the log as character vector:

+
writeLines(getLog(id = 9))
+
## ### [bt 2017-07-28 16:03:31]: This is batchtools v0.9.3.9000
+## ### [bt 2017-07-28 16:03:31]: Starting calculation of 1 jobs
+## ### [bt 2017-07-28 16:03:31]: Setting working directory to '/tmp'
+## ### [bt 2017-07-28 16:03:31]: Memory measurement disabled
+## ### [bt 2017-07-28 16:03:31]: Starting job [batchtools job.id=9]
+## Error in (function (value)  : Ooops.
+## 
+## ### [bt 2017-07-28 16:03:31]: Job terminated with an exception [batchtools job.id=9]
+## ### [bt 2017-07-28 16:03:31]: Calculation finished!
+

You can also grep for error or warning messages:

+
ids = grepLogs(pattern = "ooops", ignore.case = TRUE)
+print(ids)
+
##    job.id                              matches
+## 1:      2 Error in (function (value)  : Ooops.
+## 2:      9 Error in (function (value)  : Ooops.
+
+
+

+Workflow

+
    +
  1. Prototype You start off in your preferred working environment and start prototyping your experiments by writing a script which defines jobs/experiments with batchMap() or addProblem(), addAlgorithm() and addExperiments(), respectively.
  2. +
  3. Test +
      +
    1. Test some jobs testJobs() to identify first problems.
    2. +
    3. Test some jobs with testJobs(..., external = TRUE) to identify missing definitions not discovered while working in the interactive session.
    4. +
    +
  4. +
  5. Deploy First, you need to setup the file directory remotely +
      +
    • Copy the created file.dir to the system (e.g., using scp), or
    • +
    • re-run the script you’ve used to populate the registry with jobs locally.
    • +
    +
  6. +
  7. Monitor
  8. +
  9. Collect
  10. +
+
+
+
+ + + +
+ + + +
+ + + diff --git a/docs/articles/index.html b/docs/articles/index.html index dd023680..44db2a0c 100644 --- a/docs/articles/index.html +++ b/docs/articles/index.html @@ -23,7 +23,8 @@ - + + @@ -51,10 +52,16 @@ diff --git a/docs/articles/tikz_prob_algo_simple.pdf b/docs/articles/tikz_prob_algo_simple.pdf new file mode 100644 index 00000000..1ddc9954 Binary files /dev/null and b/docs/articles/tikz_prob_algo_simple.pdf differ diff --git a/docs/articles/v00_Setup.html b/docs/articles/v00_Setup.html deleted file mode 100644 index c5828ea1..00000000 --- a/docs/articles/v00_Setup.html +++ /dev/null @@ -1,150 +0,0 @@ - - - - - - - -Setup for batchtools • batchtools - - - - - - -
-
- - - -
-
- - - - -
-
-

-Cluster Functions

-

The communication with the batch system is managed via so-called cluster functions. They are created with the constructor makeClusterFunctions which defines how jobs are submitted on your system. Furthermore, you may provide functions to list queued/running jobs and to kill jobs.

-

Usually you do not have to start from scratch but can just use one of the cluster functions which ship with the package:

- -

To use the package with the socket cluster functions, you would call the respective constructor makeClusterFunctionsSocket():

-
reg = makeRegistry(NA)
-
## Sourcing configuration file '/home/lang/.config/batchtools/config.R' ...
-
## Loading required package: methods
-
reg$cluster.functions = makeClusterFunctionsSocket(2)
-

To make this selection permanent for this registry, save the Registry with saveRegistry. To make your cluster function selection permanent for a specific system across R sessions for all new Registries, you can set up a configuration file (see below).

-

If you have trouble debugging your cluster functions, you can enable the debug mode for extra output. To do so, install the debugme package and set the environment variable DEBUGME to batchtools before you load the batchtools package:

-
Sys.setenv(DEBUGME = "batchtools")
-library(batchtools)
-
-
-

-Template files

-

Many cluster functions require a template file as argument. These templates are used to communicate with the scheduler and contain placeholders to evaluate arbitrary R expressions. Internally, the brew package is used for this purpose. Some exemplary template files can be found here. It would be great if you would help expand this collection to cover more exotic configurations. To do so, please send your template via mail or open a new pull request.

-

Note that all variables defined in a JobCollection can be used inside the template. If you need to pass extra variables, you can set them via the argument resources of submitJobs.

-

If the flexibility which comes with templating is not sufficient, you can still construct a custom cluster function implementation yourself using the provided constructor.

-
-
-

-Configuration file

-

The configuration file can be used to set system specific options. Its default location depends on the operating system (see Registry), but for the first time setup you can put one in the current working directory (as reported by getwd()). In order to set the cluster function implementation, you would generate a file with the following content:

-
cluster.functions = makeClusterFunctionsInteractive()
-

The configuration file is parsed whenever you create or load a Registry. It is sourced inside of your registry which has the advantage that you can (a) access all of the parameters which are passed to makeRegistry and (b) you can also directly change them. Lets say you always want your working directory in your home directory and you always want to load the checkmate package on the nodes, you can just append these lines:

-
work.dir = "~"
-packages = union(packages, "checkmate")
-

See the documentation on Registry for a more complete list of supported configuration options.

-
-
-
- - - -
- - - -
- - - diff --git a/docs/articles/v01_Migration.html b/docs/articles/v01_Migration.html deleted file mode 100644 index a21757e7..00000000 --- a/docs/articles/v01_Migration.html +++ /dev/null @@ -1,249 +0,0 @@ - - - - - - - -Migrating from BatchJobs/BatchExperiments • batchtools - - - - - - -
-
- - - -
-
- - - - -
-

The development of BatchJobs and BatchExperiments is discontinued because of the following reasons:

-
    -
  • Maintainability: The packages BatchJobs and BatchExperiments are tightly connected which makes maintaining difficult. Changes have to be synchronized and tested against the current CRAN versions for compatibility. Furthermore, BatchExperiments violates CRAN policies by calling internal functions of BatchJobs.
  • -
  • Data base issues: Although we invested weeks to mitigate issues with locks of the SQLite data base or file system (staged queries, file system timeouts, …), BatchJobs kept working unreliable on some systems with high latency or specific file systems. This made BatchJobs unusable for many users.
  • -
-

BatchJobs and BatchExperiments will remain on CRAN, but new features are unlikely to be ported back.

-
-

-Comparison with BatchJobs/BatchExperiments

-
-

-Internal changes

-
    -
  • batchtools does not use SQLite anymore. Instead, all the information is stored directly in the registry using data.tables acting as an in-memory database. As a side effect, many operations are much faster.
  • -
  • Nodes do not have to access the registry. submitJobs() stores a temporary object of type JobCollection on the file system which holds all the information necessary to execute a chunk of jobs via doJobCollection() on the node. This avoids file system locks because each job accesses only one file exclusively.
  • -
  • -ClusterFunctionsMulticore now uses the parallel package for multicore execution. ClusterFunctionsSSH can still be used to emulate a scheduler-like system which respects the work load on the local machine.
  • -
-
-
-

-Interface changes

-
    -
  • batchtools remembers the last created or loaded Registry and sets it as default registry. This way, you do not need to pass the registry around anymore. If you need to work with multiple registries simultaneously on the other hand, you can still do so by explicitly passing registries to the functions.
  • -
  • Most functions now return a data.table which is keyed with the job.id. This way, return values can be joined together easily and efficient (see this help page for some examples).
  • -
  • The building blocks of a problem has been renamed from static and dynamic to the more intuitive data and fun. Thus, algorithm function should have the formal arguments job, data and instance.
  • -
  • The function makeDesign has been removed. Parameters can be defined by just passing a data.frame or data.table to addExperiments. For exhaustive designs, use expand.grid() or data.table::CJ().
  • -
-
-
-

-Template changes

-
    -
  • The scheduler should directly execute the command Rscript -e 'batchtools::doJobCollection(<filename>)'. There is no intermediate R source file like in BatchJobs.
  • -
  • All information stored in the object JobCollection can be accessed while brewing the template.
  • -
  • Some variable names have changed and need to be adapted, e.g. job.name is now job.hash.
  • -
  • Extra variables may be passed via the argument resoures of submitJobs.
  • -
-
-
-

-New features

-
    -
  • Support for Docker Swarm via ClusterFunctionsDocker.
  • -
  • Jobs can now be tagged and untagged to provide an easy way to group them.
  • -
  • Some resources like the number of CPUs are now optionally passed to parallelMap. This eases nested parallelization, e.g. to use multicore parallelization on the slave by just setting a resource on the master. See submitJobs() for an example.
  • -
  • -ClusterFunctions are now more flexible in general as they can define hook functions which will be called at certain events. ClusterFunctionsDocker is an example use case which implements a housekeeping routine. This routine is called every time before a job is about to get submitted to the scheduler (in the case: the Docker Swarm) via the hook pre.submit and every time directly after the registry synchronized jobs stored on the file system via the hook post.sync.
  • -
  • More new features are covered in the NEWS.
  • -
-
-
-
-

-Porting to batchtools

-

The following table assists in porting to batchtools by mapping BatchJobs/BatchExperiments functions to their counterparts in batchtools. The table does not cover functions which are (a) used only internally in BatchJobs and (b) functions which have not been renamed.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
BatchJobsbatchtools
addRegistryPackagesSet reg$packages or reg$namespaces, call saveRegistry -
addRegistrySourceDirs-
addRegistrySourceFilesSet reg$source, call saveRegistry -
batchExpandGrid -batchMap: batchMap(..., args = CJ(x = 1:3, y = 1:10)) -
batchMapQuickbtmapply
batchReduceResults-
batchUnexportbatchExport
filterResults-
getJobIdsfindJobs
getJobInfogetJobStatus
getJobmakeJob
getJobParamDfgetJobPars
loadResultsreduceResultsList
reduceResultsDataFramereduceResultsDataTable
reduceResultsMatrix -reduceResultsList + do.call(rbind, res) -
reduceResultsVectorreduceResultsDataTable
setJobFunction-
setJobNames-
showStatusgetStatus
-
-
-
- - - -
- - - -
- - - diff --git a/docs/articles/v10_ExamplePiSim.html b/docs/articles/v10_ExamplePiSim.html deleted file mode 100644 index 4851924b..00000000 --- a/docs/articles/v10_ExamplePiSim.html +++ /dev/null @@ -1,142 +0,0 @@ - - - - - - - -Example 1: Approximation of Pi • batchtools - - - - - - -
-
- - - -
-
- - - - -
-

To get a first insight into the usage of batchtools, we start with an exemplary Monte Carlo simulation to approximate \(\pi\). For background information, see Wikipedia.

-

First, a so-called registry object has to be created, which defines a directory where all relevant information, files and results of the computational jobs will be stored. There are two different types of registry objects: First, a regular Registry which we will use in this example. Second, an ExperimentRegistry which provides an alternative way to define computational jobs and thereby is tailored for a broad range of large scale computer experiments (see, for example, this vignette). Here, we use a temporary registry which is stored in the temp directory of the system and gets automatically deleted if you close the R session.

-
library(batchtools)
-reg = makeRegistry(file.dir = NA, seed = 1)
-

For a permanent registry, set the file.dir to a valid path. It can then be reused later, e.g., when you login to the system again, by calling the function loadRegistry(file.dir).

-

When a registry object is created or loaded, it is stored for the active R session as the default. Therefore the argument reg will be ignored in functions calls of this example, assuming the correct registry is set as default. To get the current default registry, getDefaultRegistry can be used. To switch to another registry, use setDefaultRegistry().

-

First, we create a function which samples \(n\) points \((x_i, y_i)\) whereas \(x_i\) and \(y_i\) are distributed uniformly, i.e. \(x_i, y_i \sim \mathcal{U}(0,1)\). Next, the distance to the origin \((0, 0)\) is calculated and the fraction of points in the unit circle (\(d \leq 1\)) is returned.

-
piApprox = function(n) {
-  nums = matrix(runif(2 * n), ncol = 2)
-  d = sqrt(nums[, 1]^2 + nums[, 2]^2)
-  4 * mean(d <= 1)
-}
-piApprox(1000)
-
## [1] 3.108
-

We now parallelize piApprox() with batchtools: We create 10 jobs, each doing a MC simulation with \(10^5\) jobs. We use batchMap() to define the jobs (note that this does not yet start the calculation):

-
batchMap(fun = piApprox, n = rep(1e5, 10))
-
## Adding 10 jobs ...
-

The length of the vector or list defines how many different jobs are created, while the elements itself are used as arguments for the function. The function batchMap(fun, ...) works analogously to Map(f, ...) of the base package. An overview over the jobs and their IDs can be retrieved with getJobTable() which returns a data frame with all relevant information:

-
names(getJobTable())
-
##  [1] "job.id"       "submitted"    "started"      "done"        
-##  [5] "error"        "memory"       "batch.id"     "log.file"    
-##  [9] "job.hash"     "time.queued"  "time.running" "n"           
-## [13] "tags"
-

Note that a unique job ID is assigned to each job. These IDs can be used to restrict operations to subsets of jobs. To actually start the calculation, call submitJobs(). The registry and the selected job IDs can be taken as arguments as well as an arbitrary list of resource requirements, which are to be handled by the cluster back end.

-
submitJobs(resources = list(walltime = 3600, memory = 1024))
-
## Ignoring resource 'chunks.as.arrayjobs', not supported by cluster functions 'Interactive'
-
## Submitting 10 jobs in 10 chunks using cluster functions 'Interactive' ...
-

In this example, a cap for the execution time (so-called walltime) and for the maximum memory requirements are set. The progress of the submitted jobs can be checked with getStatus().

- -
## Syncing 10 files ...
-
## Status for 10 jobs:
-##   Submitted : 10 (100.0%)
-##   Queued    :  0 (  0.0%)
-##   Started   : 10 (100.0%)
-##   Running   :  0 (  0.0%)
-##   Done      : 10 (100.0%)
-##   Error     :  0 (  0.0%)
-##   Expired   :  0 (  0.0%)
-

The resulting output includes the number of jobs in the registry, how many have been submitted, have started to execute on the batch system, are currently running, have successfully completed, and have terminated due to an R exception. After jobs have successfully terminated, we can load their results on the master. This can be done in a simple fashion by using either loadResult(), which returns a single result exactly in the form it was calculated during mapping, or by using reduceResults(), which is a version of Reduce() from the base package for registry objects.

- -
## [1] TRUE
-
mean(sapply(1:10, loadResult))
-
## [1] 3.140652
-
reduceResults(function(x, y) x + y) / 10
-
## [1] 3.140652
-

If you are absolutely sure that your function works, you can take a shortcut and use batchtools in an lapply fashion using btlapply(). This function creates a temporary registry (but you may also pass one yourself), calls batchMap(), wait for the jobs to terminate with waitForJobs() and then uses reduceResultsList() to return the results.

-
res = btlapply(rep(1e5, 10), piApprox)
-
## Sourcing configuration file '/home/lang/.config/batchtools/config.R' ...
-
## Adding 10 jobs ...
-
## Ignoring resource 'chunks.as.arrayjobs', not supported by cluster functions 'Interactive'
-
## Submitting 10 jobs in 10 chunks using cluster functions 'Interactive' ...
-
## Syncing 10 files ...
-
mean(unlist(res))
-
## [1] 3.139272
-
-
- - - -
- - - -
- - - diff --git a/docs/articles/v11_ExampleExperiment.html b/docs/articles/v11_ExampleExperiment.html deleted file mode 100644 index 5184fd7e..00000000 --- a/docs/articles/v11_ExampleExperiment.html +++ /dev/null @@ -1,249 +0,0 @@ - - - - - - - -Example 2: Problems and Algorithms • batchtools - - - - - - -
-
- - - -
-
- - - - -
-
-

-Intro

-

We stick to a rather simple, but not unrealistic example to explain some further functionalities: Applying two classification learners to the famous iris data set (Anderson 1935), vary a few hyperparameters and evaluate the effect on the classification performance.

-

First, we create a registry, the central meta-data object which records technical details and the setup of the experiments. We use an ExperimentRegistry where the job definition is split into creating problems and algorithms. See the paper on BatchJobs and BatchExperiments for a detailed explanation. Again, we use a temporary registry and make it the default registry.

-
library(batchtools)
-reg = makeExperimentRegistry(file.dir = NA, seed = 1)
-
-
-

-Problems and algorithms

-

By adding a problem to the registry, we can define the data on which certain computational jobs shall work. This can be a matrix, data frame or array that always stays the same for all subsequent experiments. But it can also be of a more dynamic nature, e.g., subsamples of a dataset or random numbers drawn from a probability distribution . Therefore the function addProblem() accepts static parts in its data argument, which is passed to the argument fun which generates a (possibly stochastic) problem instance. For data, any R object can be used. If only data is given, the generated instance is data. The argument fun has to be a function with the arguments data and job (and optionally other arbitrary parameters). The argument job is an object of type Job which holds additional information about the job.

-

We want to split the iris data set into a training set and test set. In this example we use use subsampling which just randomly takes a fraction of the observations as training set. We define a problem function which returns the indices of the respective training and test set for a split with 100 * ratio% of the observations being in the test set:

-
subsample = function(data, job, ratio, ...) {
-  n = nrow(data)
-  train = sample(n, floor(n * ratio))
-  test = setdiff(seq_len(n), train)
-  list(test = test, train = train)
-}
-

addProblem() files the problem to the file system and the problem gets recorded in the registry.

-
data("iris", package = "datasets")
-addProblem(name = "iris", data = iris, fun = subsample, seed = 42)
-

The function call will be evaluated at a later stage on the workers. In this process, the data part will be loaded and passed to the function. Note that we set a problem seed to synchronize the experiments in the sense that the same resampled training and test sets are used for the algorithm comparison in each distinct replication.

-

The algorithms for the jobs are added to the registry in a similar manner. When using addAlgorithm(), an identifier as well as the algorithm to apply to are required arguments. The algorithm must be given as a function with arguments job, data and instance. Further arbitrary arguments (e.g., hyperparameters or strategy parameters) may be defined analogously as for the function in addProblem. The objects passed to the function via job and data are here the same as above, while via instance the return value of the evaluated problem function is passed. The algorithm can return any R object which will automatically be stored on the file system for later retrieval. Firstly, we create an algorithm which applies a support vector machine:

-
svm.wrapper = function(data, job, instance, ...) {
-  library("e1071")
-  mod = svm(Species ~ ., data = data[instance$train, ], ...)
-  pred = predict(mod, newdata = data[instance$test, ], type = "class")
-  table(data$Species[instance$test], pred)
-}
-addAlgorithm(name = "svm", fun = svm.wrapper)
-

Secondly, a random forest of classification trees:

-
forest.wrapper = function(data, job, instance, ...) {
-  library("ranger")
-  mod = ranger(Species ~ ., data = data[instance$train, ], write.forest = TRUE)
-  pred = predict(mod, data = data[instance$test, ])
-  table(data$Species[instance$test], pred$predictions)
-}
-addAlgorithm(name = "forest", fun = forest.wrapper)
-

Both algorithms return a confusion matrix for the predictions on the test set, which will later be used to calculate the misclassification rate.

-

Note that using the ... argument in the wrapper definitions allows us to circumvent naming specific design parameters for now. This is an advantage if we later want to extend the set of algorithm parameters in the experiment. The algorithms get recorded in the registry and the corresponding functions are stored on the file system.

-

Defined problems and algorithms can be queried:

- -
## [1] "iris"
- -
## [1] "svm"    "forest"
-

The flow to define experiments is summarized in the following figure:

-
-
-

-Creating jobs

-

addExperiments() is used to parametrize the jobs and thereby define computational jobs. To do so, you have to pass named lists of parameters to addExperiments(). The elements of the respective list (one for problems and one for algorithms) must be named after the problem or algorithm they refer to. The data frames contain parameter constellations for the problem or algorithm function where columns must have the same names as the target arguments. When the problem design and the algorithm design are combined in addExperiments(), each combination of the parameter sets of the two designs defines a distinct job. How often each of these jobs should be computed can be determined with the argument repls.

-
# problem design: try two values for the ratio parameter
-pdes = list(iris = data.frame(ratio = c(0.67, 0.9)))
-
-# algorithm design: try combinations of kernel and epsilon exhaustively,
-# try different number of trees for the forest
-ades = list(
-  svm = expand.grid(kernel = c("linear", "polynomial", "radial"), epsilon = c(0.01, 0.1)),
-  forest = data.frame(ntree = c(100, 500, 1000))
-)
-
-addExperiments(pdes, ades, repls = 5)
-
## Adding 60 experiments ('iris'[2] x 'svm'[6] x repls[5]) ...
-
## Adding 30 experiments ('iris'[2] x 'forest'[3] x repls[5]) ...
-

The jobs are now available in the registry with an individual job ID for each. The function summarizeExperiments() returns a table which gives a quick overview over all defined experiments.

- -
##    problem algorithm .count
-## 1:    iris       svm     60
-## 2:    iris    forest     30
-
summarizeExperiments(by = c("problem", "algorithm", "ratio"))
-
##    problem algorithm ratio .count
-## 1:    iris       svm  0.67     30
-## 2:    iris       svm  0.90     30
-## 3:    iris    forest  0.67     15
-## 4:    iris    forest  0.90     15
-
-
-

-Before you submit

-

Before submitting all jobs to the batch system, we encourage you to test each algorithm individually. Or sometimes you want to submit only a subset of experiments because the jobs vastly differ in runtime. Another reoccurring task is the collection of results for only a subset of experiments. For all these use cases, findExperiments() can be employed to conveniently select a particular subset of jobs. It returns the IDs of all experiments that match the given criteria. Your selection can depend on substring matches of problem or algorithm IDs using prob.name or algo.name, respectively. You can also pass R expressions, which will be evaluated in your problem parameter setting (prob.pars) or algorithm parameter setting (algo.pars). The expression is then expected to evaluate to a Boolean value. Furthermore, you can restrict the experiments to specific replication numbers.

-

To illustrate findExperiments(), we will select two experiments, one with a support vector machine and the other with a random forest and the parameter ntree = 1000. The selected experiment IDs are then passed to testJob.

-
id1 = head(findExperiments(algo.name = "svm"), 1)
-print(id1)
-
##    job.id
-## 1:      1
-
id2 = head(findExperiments(algo.name = "forest", algo.pars = (ntree == 1000)), 1)
-print(id2)
-
##    job.id
-## 1:     71
-
testJob(id = id1)
-
## Generating problem instance for problem 'iris' ...
-## Applying algorithm 'svm' on problem 'iris' ...
-
##             pred
-##              setosa versicolor virginica
-##   setosa         17          0         0
-##   versicolor      0         16         2
-##   virginica       0          0        15
-
testJob(id = id2)
-
## Generating problem instance for problem 'iris' ...
-## Applying algorithm 'forest' on problem 'iris' ...
-
##             
-##              setosa versicolor virginica
-##   setosa         17          0         0
-##   versicolor      0         16         2
-##   virginica       0          1        14
-

If something goes wrong, batchtools comes with a bunch of useful debugging utilities (see separate vignette on error handling). If everything turns out fine, we can proceed with the calculation.

-
-
-

-Submitting and collecting results

-

To submit the jobs, we call submitJobs() and wait for all jobs to terminate using waitForJobs().

- -
## Ignoring resource 'chunks.as.arrayjobs', not supported by cluster functions 'Interactive'
-
## Submitting 90 jobs in 90 chunks using cluster functions 'Interactive' ...
- -
## Syncing 90 files ...
-
## [1] TRUE
-

After jobs are finished, the results can be collected with reduceResultsDataTable() where we directly extract the mean misclassification error:

-
results = reduceResultsDataTable(fun = function(res) (list(mce = (sum(res) - sum(diag(res))) / sum(res))))
-head(results)
-
##    job.id  mce
-## 1:      1 0.04
-## 2:      2 0.00
-## 3:      3 0.06
-## 4:      4 0.04
-## 5:      5 0.02
-## 6:      6 0.06
-

Next, we merge the results table with the table of job parameters using one of the join helpers provided by batchtools (here, we use an inner join):

-
tab = ijoin(getJobPars(), results)
-head(tab)
-
##    job.id problem algorithm ratio     kernel epsilon ntree  mce
-## 1:      1    iris       svm  0.67     linear    0.01    NA 0.04
-## 2:      2    iris       svm  0.67     linear    0.01    NA 0.00
-## 3:      3    iris       svm  0.67     linear    0.01    NA 0.06
-## 4:      4    iris       svm  0.67     linear    0.01    NA 0.04
-## 5:      5    iris       svm  0.67     linear    0.01    NA 0.02
-## 6:      6    iris       svm  0.67 polynomial    0.01    NA 0.06
-

We now aggregate the results group-wise. You can use data.table, base::aggregate(), or the dplyr package for this purpose. Here, we use data.table to subset the table to jobs where the ratio is 0.67 and group by algorithm the algorithm hyperparameters:

-
tab[ratio == 0.67, list(mmce = mean(mce)), by = c("algorithm", "kernel", "epsilon", "ntree")]
-
##    algorithm     kernel epsilon ntree  mmce
-## 1:       svm     linear    0.01    NA 0.032
-## 2:       svm polynomial    0.01    NA 0.088
-## 3:       svm     radial    0.01    NA 0.048
-## 4:       svm     linear    0.10    NA 0.032
-## 5:       svm polynomial    0.10    NA 0.088
-## 6:       svm     radial    0.10    NA 0.048
-## 7:    forest         NA      NA   100 0.048
-## 8:    forest         NA      NA   500 0.052
-## 9:    forest         NA      NA  1000 0.044
-
-
-
- - - -
- - - -
- - - diff --git a/docs/articles/v20_ErrorHandling.html b/docs/articles/v20_ErrorHandling.html deleted file mode 100644 index 97ec951e..00000000 --- a/docs/articles/v20_ErrorHandling.html +++ /dev/null @@ -1,148 +0,0 @@ - - - - - - - -Error Handling • batchtools - - - - - - -
-
- - - -
-
- - - - -
-

In any large scale experiment many things can and will go wrong. The cluster might have an outage, jobs may run into resource limits or crash, subtle bugs in your code could be triggered or any other error condition might arise. In these situations it is important to quickly determine what went wrong and to recompute only the minimal number of required jobs.

-

Therefore, before you submit anything you should use testJob() to catch errors that are easy to spot because they are raised in many or all jobs. If external is set, this function runs the job without side effects in an independent R process on your local machine via Rscript similar as on the slave, redirects the output of the process to your R console, loads the job result and returns it. If you do not set external, the job is executed is in the currently running R session, with the drawback that you might be unable to catch missing variable declarations or missing package dependencies.

-

By way of illustration here is a small example. First, we create a temporary registry.

-
library(batchtools)
-reg = makeRegistry(file.dir = NA, seed = 1)
-

Ten jobs are created, two of them will throw an exception.

-
flakeyFunction <- function(value) {
-  if (value %in% c(2, 9)) stop("Ooops.")
-  value^2
-}
-batchMap(flakeyFunction, 1:10)
-
## Adding 10 jobs ...
-

Now that the jobs are defined, we can test jobs independently:

-
testJob(id = 1)
-
## [1] 1
-

In this case, testing the job with ID = 1 provides the appropriate result but testing the job with ID = 2 leads to an error:

-
as.character(try(testJob(id = 2)))
-
## [1] "Error in (function (value)  : Ooops.\n"
-

When you have already submitted the jobs and suspect that something is going wrong, the first thing to do is to run getStatus() to display a summary of the current state of the system.

- -
## Ignoring resource 'chunks.as.arrayjobs', not supported by cluster functions 'Interactive'
-
## Submitting 10 jobs in 10 chunks using cluster functions 'Interactive' ...
- -
## Syncing 10 files ...
-
## [1] FALSE
- -
## Status for 10 jobs:
-##   Submitted : 10 (100.0%)
-##   Queued    :  0 (  0.0%)
-##   Started   : 10 (100.0%)
-##   Running   :  0 (  0.0%)
-##   Done      :  8 ( 80.0%)
-##   Error     :  2 ( 20.0%)
-##   Expired   :  0 (  0.0%)
-

The status message shows that two of the jobs could not be executed successfully. To get the IDs of all jobs that failed due to an error we can use findErrors() and to retrieve the actual error message, we can use getErrorMessages().

- -
##    job.id
-## 1:      2
-## 2:      9
- -
##    job.id terminated error                              message
-## 1:      2       TRUE  TRUE Error in (function (value)  : Ooops.
-## 2:      9       TRUE  TRUE Error in (function (value)  : Ooops.
-

If we want to peek into the R log file of a job to see more context for the error we can use showLog() which opens a pager or use getLog() to get the log as character vector:

-
writeLines(getLog(id = 9))
-
## ### [bt 2017-04-20 12:09:20]: This is batchtools v0.9.2.9000
-## ### [bt 2017-04-20 12:09:20]: Starting calculation of 1 jobs
-## ### [bt 2017-04-20 12:09:20]: Setting working directory to '/tmp'
-## ### [bt 2017-04-20 12:09:20]: Memory measurement disabled
-## ### [bt 2017-04-20 12:09:20]: Starting job [batchtools job.id=9]
-## Error in (function (value)  : Ooops.
-## 
-## ### [bt 2017-04-20 12:09:20]: Job terminated with an exception [batchtools job.id=9]
-## ### [bt 2017-04-20 12:09:20]: Calculation finished!
-

You can also grep for error or warning messages:

-
ids = grepLogs(pattern = "ooops", ignore.case = TRUE)
-print(ids)
-
##    job.id                              matches
-## 1:      2 Error in (function (value)  : Ooops.
-## 2:      9 Error in (function (value)  : Ooops.
-
-
- - - -
- - - -
- - - diff --git a/docs/authors.html b/docs/authors.html index 8a4ad372..89c27643 100644 --- a/docs/authors.html +++ b/docs/authors.html @@ -6,7 +6,7 @@ -Authors • batchtools +Citation and Authors • batchtools @@ -23,7 +23,8 @@ - + + @@ -36,7 +37,7 @@ -
+
-
+
+ + + +

Lang M, Bischl B and Surmann D (2017). +“batchtools: Tools for R to work on batch systems.” +The Journal of Open Source Software, 2(10). +doi: 10.21105/joss.00135, https://doi.org/10.21105/joss.00135. +

+
@Article{,
+  title = {batchtools: Tools for R to work on batch systems},
+  author = {Michel Lang and Bernd Bischl and Dirk Surmann},
+  journal = {The Journal of Open Source Software},
+  year = {2017},
+  month = {feb},
+  volume = {2},
+  number = {10},
+  doi = {10.21105/joss.00135},
+  url = {https://doi.org/10.21105/joss.00135},
+}
+

Bischl B, Lang M, Mersmann O, Rahnenführer J and Weihs C (2015). +“BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments.” +Journal of Statistical Software, 64(11), pp. 1–25. +http://www.jstatsoft.org/v64/i11/. +

+
@Article{,
+  title = {{BatchJobs} and {BatchExperiments}: Abstraction Mechanisms for Using {R} in Batch Environments},
+  author = {Bernd Bischl and Michel Lang and Olaf Mersmann and J{\"o}rg Rahnenf{\"u}hrer and Claus Weihs},
+  journal = {Journal of Statistical Software},
+  year = {2015},
+  volume = {64},
+  number = {11},
+  pages = {1--25},
+  url = {http://www.jstatsoft.org/v64/i11/},
+}
diff --git a/docs/index.html b/docs/index.html index f399bab6..6453c54c 100644 --- a/docs/index.html +++ b/docs/index.html @@ -29,10 +29,16 @@

@@ -84,7 +90,7 @@

  • Maintainability: The packages BatchJobs and BatchExperiments are tightly connected which makes maintenance difficult. Changes have to be synchronized and tested against the current CRAN versions for compatibility. Furthermore, BatchExperiments violates CRAN policies by calling internal functions of BatchJobs.
  • Data base issues: Although we invested weeks to mitigate issues with locks of the SQLite data base or file system (staged queries, file system timeouts, …), BatchJobs kept working unreliable on some systems with high latency or specific file systems. This made BatchJobs unusable for many users.
  • -

    BatchJobs and BatchExperiments will remain on CRAN, but new features are unlikely to be ported back. See this vignette for a comparison of the packages.

    +

    BatchJobs and BatchExperiments will remain on CRAN, but new features are unlikely to be ported back. The vignette contains a section comparing the packages.

    @@ -92,11 +98,11 @@

    @@ -147,6 +153,10 @@

    Links

    License

    LGPL-3

    +

    Citation

    +

    Developers

    • Michel Lang
      Maintainer, author
    • diff --git a/docs/news/index.html b/docs/news/index.html index 78d87923..26821c4b 100644 --- a/docs/news/index.html +++ b/docs/news/index.html @@ -23,7 +23,8 @@ - + + @@ -51,10 +52,16 @@
      +
      +

      +batchtools 0.9.4

      +
        +
      • Fixed handling of file.dir with special chars like whitespace.
      • +
      • All backward slashes will now be converted to forward slashes on windows.
      • +
      • Fixed order of arguments in findExperiments() (argument ids is now first).
      • +
      • Removed code to upgrade registries created with versions prior to v0.9.0 (first CRAN release).
      • +
      • +addExperiments() now warns if a design is passed as data.frame with factor columns and stringsAsFactors is TRUE.
      • +
      +

      batchtools 0.9.3

      @@ -98,6 +117,8 @@

    • Introduced flatten to control if the result should be represented as a column of lists or flattened as separate columns. Defaults to a backward-compatible heuristic, similar to getJobPars.
    +
  • Improved heuristic to lookup template files. Templates shipped with the package can now be used by providing just the file name (w/o extension).
  • +
  • Updated CITATION
  • @@ -130,7 +151,7 @@

    batchtools 0.9.0

    -

    Initial CRAN release. See this vignette for a brief comparison with BatchJobs/BatchExperiments.

    +

    Initial CRAN release. See the vignette for a brief comparison with BatchJobs/BatchExperiments.

    @@ -139,6 +160,7 @@

    Contents