Python code (tested on 2.7) for bayesian approaches to distribution regression, the details are described in the following paper:
H. Law*, D. Sutherland*, D. Sejdinovic, S. Flaxman, Bayesian Approaches to Distribution Regression, in Artificial Intelligence and Statistics (AISTATS), 2018. arxiv
* denotes equal contribution
To setup as a package, clone the repository and run
python setup.py develop
This package also requires TensorFlow (tested on v1.4.1) to be installed.
The directory is organised as follows:
- bdr: contains the main code, including data scripts
- experiment_code: contains the API code for all the networks
- experiment_config: contains the experimental setup/configurations described in the paper
- results: contains the results of the experiments described in the paper, and the corresponding plot functions for them.
There are two main datasets found in this paper:
- gamma synthetic data: Simulated by
bdr/data/toy.py
, seed provided inexperiment_code
- IMDb-WIKI features: Features of celebrity images taken from the output layer of a VGG-16 CNN architecture
The IMDb-WIKI features can be download by running
bdr/data/imdb_faces/all-feats/get.sh
bdr/data/imdb_faces/grouped-4/get.sh
The python notebook /bdr/data/imdb_faces/group_data.ipynb
provides the data cleaning process, analysis and grouping process.
The main API can be found in /experiment_code
, where:
train_test.py
: contains API code for RBF network, shrinkage, shrinkageC and also fourier features network.blr.py
: contains API code for bayesian linear regression.chi2_make_optimal.py
: contains API code for the Bayes-optimal for the varying bag size experiment.bdr-landmarks.R
: contains API code for the BDR MCMC algorithm, note that this is in R.
See BDR: Reproducing the experiments below for the discussion of bdr-landmarks.R
and how to use it. For the rest of the API, they make use of the argparse
package , i.e. parameters can be specified in command line. To bring up the manual (and see the default options), run in command line:
python train_test.py --help
An example of constructing a shrinkage network and training it with simulated data using argparse
is shown here:
python train_test.py chi2 -n shrinkage --learning-rate 0.001 --n-landmarks 50 /folder/to/save/to
This would train a shrinkage network on simulated data with learning rate 0.001, 50 landmarks and save results to /folder/to/save/to
. For other parameters not specified, it would use the default options. Likewise, the blr.py
and chi2_make_optimal.py
can be used in a similar fashion. It is also noted by default, it will parallise, unless specified otherwise.
There are 3 main experiments in the paper, namely Varying bag size: Uncertainty in the inputs, Fixed bag size: Uncertainty in the regression model and IMDb-WIKI: Age Estimation. The experimental setup and the exact grid for each of these experiments can be found in /experiment_config
. Note that the each network on the IMDb-WIKI dataset will take around 4 hours to train roughly (depending on the parameters) on four 2 - E5-2690 v4 @ 2.60GHz CPUs.
The /results
folder contains the results of the models that performed best on the validation set for our experiments, each experiment results folder also contain corresponding notebooks for baseline or plotting purposes.
For the BDR algorithm, since it is a full MCMC algorithm, we will be making use of RStan instead of TensorFlow in python.
To reproduce the experiments on the gamma synthetic data, we need to first export the data manually, which is then accessed. This can be done by using /utilities/export_toy_stan.py
, and turning on the necessary options. The data will be saved in a directory called stan_data
, where each directory which contain csv files for the train, validation and test sets, as well a the corresponding landmarks.