This repository contains scripts and notebooks to reproduce the experiments and analyses of the paper
Adrian Englhardt, Klemens Böhm, “Exploring the Unknown - Query Synthesis in One-Class Active Learning”. In: Proceedings of the 2020 SIAM International Conference on Data Mining (SDM), DOI: 10.1137/1.9781611976236.17, May 7-9, 2020, Cincinnati, Ohio, USA.
For more information about this research project, see also the project website. For a general overview and a benchmark on one-class active learning see the OCAL project website.
The analysis and main results of the experiments can be found under notebooks:
domain_expansion_strategy.ipynb
: Figure 3experiment_evaluation.ipynb
: Figure 4 and Table 1svdd_neg_eps.ipynb
: Example forSVDDnegEps
To execute the notebooks, make sure you follow the setup, and download the raw results into data/output/
.
The experiments are implemented in Julia, and some the evaluation notebooks are written in Python. This repository contains code to setup, execute and analyze the experiments. The one-class classifiers (SVDDneg) and active learning methods (all query synthesis strategies) are implemented in two separate Julia packages: SVDD.jl and OneClassActiveLearning.jl.
Just clone the repo.
$ git clone https://github.com/englhardt/des-evaluation.git
- Experiments require Julia 1.1.0, requirements are defined in
Manifest.toml
. To instantiate, start julia in thedes-evaluation
directory withjulia --project
and runjulia> ]instantiate
. See Julia documentation for general information on how to setup this project. - Notebooks require
- Julia 1.1.0 (dependencies are already installed in the previous step)
- Python 3.7 and
pipenv
. Runpipenv install
to install all dependencies
data
input
raw
: unprocessed data filesdami
: contains data set collectionsliterature
andsemantic
from the DAMI repository
processed
: output directory of preprocessing_dami.py
output
: output directory of experiments; generate_biased_sample_experiments.jl and generate_gmm_holdout_experiments.jl create the folder structure and experiments; run_experiments.jl writes results and log files
notebooks
: jupyter notebooks to analyze experimental resultsdomain_expansion_strategy.ipynb
: Figure 3experiment_evaluation.ipynb
: Figure 4 and Table 1svdd_neg_eps.ipynb
: Example forSVDDnegEps
scripts
config
: configuration files for experimentsconfig.jl
: high-level configuration, e.g., for number of workersconfig_eval_part_1.jl
: experiment config for synthetic data setsconfig_eval_part_2_qss.jl
: experiment config for real-world data sets
biased_sample_utils.jl
: utilities to generate biased samples in existing data setsgenerate_biased_sample_experiments.jl
: generate experiments on real-world datagenerate_gmm_holdout_experiments.jl
: generates experiments on synthetic datagmm_utils.jl
: utilities to generate synthetic domain expansion problemspreprocessing_dami.py
: preprocess DAMI data setsreduce_results
: combine result files into a single CSVrun_experiments
: executes experiments
Each step of the experiments can be reproduced, from the raw data files to the final plots that are presented in the paper. The experiment is a pipeline of several dependent processing steps. Each of the steps can be executed standalone, and takes a well-defined input, and produces a specified output. The Section Experiment Pipeline describes each of the process steps.
Running the benchmark is compute intensive and takes many CPU hours. Therefore, we also provide the results to download (866 MB). This allows to analyze the results in the notebooks without having to run the whole pipeline.
The code is licensed under a MIT License and the result data under a Creative Commons Attribution 4.0 International License. If you use this code or data set in your scientific work, please reference the companion paper.
The experiment pipeline uses config files to set paths and experiment parameters. There are two types of config files:
scripts/config.jl
: this config defines high-level information on the experiment, such as number of workers, where the data files are located, and log levels.scripts/<config_eval_part_1|config_eval_part_2_qss>.jl
: These config files define the experimental grid, including the data sets, classifiers, and active-learning strategies.
-
Data Preprocessing: The preprocessing step transforms publicly available benchmark data sets into a common csv format, and performs feature selection.
- Input: Download semantic.tar.gz and literature.tar.gz containing the .arff files from the DAMI benchmark repository and extract into
data/input/raw/dami/<data set>
(e.g.data/input/raw/dami/literature/ALOI/
ordata/input/raw/dami/semantic/Annthyroid
). - Execution:
$ pipenv run preprocessing
- Output: .csv files in
data/input/processed/dami/
We also provide our preprocessed data to download (3.7 MB).
- Input: Download semantic.tar.gz and literature.tar.gz containing the .arff files from the DAMI benchmark repository and extract into
-
Generate Experiments: This step creates a set of experiments. For the synthetic evaluation the scripts generate the data as well.
- Input: Full path to config file
<config_eval_part_1.jl|config_eval_part_2_qss.jl>
(e.g., config/config_eval_part_1.jl), preprocessed data files - Execution:
$ julia --project scripts/generate_experiments.jl $(DIR)/scripts/config/config_eval_part_1.jl $ julia --project scripts/generate_experiments.jl $(DIR)/scripts/config/config_eval_part_2_qss.jl
- Output:
- Creates an experiment directory with the naming
<exp_name>
. The directories created contains several items:log
directory: skeleton for experiment logs (one file per experiment), and worker logs (one file per worker)results
directory: skeleton for result filesexperiments.jser
: this contains a serialized Julia Array with experiments. Each experiment is a Dict that contains the specific combination. Each experiment can be identified by a unique hash value.experiment_hashes
: file that contains the hash values of the experiments stored inexperiments.jser
generate_<gmm_holdout|biased_sample>_experiments.jl
: a copy of the file that generated the experiments<config_eval_part_1.jl|config_eval_part_2_qss.jl>
: a copy of the config file used to generate the experiments
- Creates an experiment directory with the naming
- Input: Full path to config file
-
Run Experiments: This step executes the experiments created in Step 2. Each experiment is executed on a worker. In the default configuration, a worker is one process on the localhost. For distributed workers, see Section Infrastructure and Parallelization. A worker takes one specific configuration, runs the active learning experiment, and writes result and log files.
- Input: Generated experiments from step 2, full path to high-level config
scripts/config/config.jl
- Execution:
$ julia --project scripts/run_experiments.jl $(DIR)/scripts/config/config.jl
- Output: The output files are named by the experiment hash and are .json files (e.g.,
data/output/eval_part_1/results/data/gmm_holdout_1_seed_2_dim_3_gaussians_1_num_gaussians_train_DecisionBoundaryQss_SVDDneg_16283024028153567650.json
)
-
Reduce Results: Merge of an experiment directory into one .csv by using summary statistics
- Input: Full path to finished experiments.
- Execution:
$ julia --project scripts/reduce_results.jl </full/path/to/data/output>
- Output: A result csv file,
data/output/output.csv
.
-
Analyze Results: jupyter notebooks in the
notebooks
directory to analyze the reduced.csv
. Run the following to produce the figures and tables in the experiment section of the paper:
$ pipenv run evaluation
Step 3 Run Experiments can be parallelized over several workers. In general, one can use any ClusterManager. In this case, the node that executes run_experiments.jl
is the driver node. The driver node loads the experiments.jser
, and initiates a function call for each experiment on one of the workers via pmap
.
This package is developed and maintained by Adrian Englhardt