Classifier Conditional Independence Test: A CI test that uses a binary classifier (XGBoost) for CI testing
This is an implementation of the paper: https://arxiv.org/abs/1709.06138
Please cite the above paper if this package is used in any publication.
Usage for pip install
pip install CCIT==0.4
orsudo -H pip install CCIT==0.4
.
2(a). Now in your python script:
from CCIT import CCIT
from CCIT import DataGen
pvalue = CCIT.CCIT(X,Y,Z) #without bootstrap
pvalue = CCIT.CCIT(X,Y,Z,num_iter = 30, bootstrap = True, nthread = 20) #with 30 bootstrap iterations and 20 threads in parallel.
2(b). If you want to test using the included DataGen module:
from CCIT import CCIT
from CCIT import DataGen
data = DataGen.generate_samples_cos(dx=1,dy=1,dz=20,sType='NI') #non-CI dataset, pvalue should be low
X = data[:,0:1]
Y = data[:,1:2]
Z = data[:,2::]
pvalue = CCIT.CCIT(X,Y,Z) #without bootstrap
pvalue = CCIT.CCIT(X,Y,Z,num_iter = 30, bootstrap = True, nthread = 20) #with 30 bootstrap iterations and 20 threads in parallel.
We suggest normalizing each column of the data either standard normalization or bringing all values in each column in the range [0,1], for the best performance
Note that when Z is None
, it produces a pvalue for independence test between X and Y.
It is recommended to recale all columns of the data by standard deviation
Usage for pip install from github repo
-
clone the repo.
-
cd CCIT
-
pip install .
-
(Optional) from the root directory of the package, run the command
nosetests
This is a comprehensive test and may take some time to run.
- Now in your python script:
from CCIT import CCIT
pvalue = CCIT.CCIT(X,Y,Z)
There may be some trouble in installing the xgboost dependency. In that case it is recommended to follow the steps in https://github.com/dmlc/xgboost/blob/master/python-package/build_trouble_shooting.md for installing xgboost first. Then install CCIT from pip.
CI Tester
Functions:
- CCIT()
Main function to generate pval of the CI test. If pval is low CI is rejected if its high we fail to reject CI.
X: Input X table
Y: Input Y table
Z: Input Z table. If None then it reverts back to Independence test between X and Y.
Optional Arguments:
max_depths : eg. [6,10,13] list of parameters for depth of tree in xgb for tuning
n_estimators: eg. [100,200,300] list of parameters for number of estimators for xgboost for tuning
colsample_bytrees: eg. recommended [0.8] list of parameters for colsample_bytree for xgboost for tuning
nfold: n-fold cross validation
feature_selection : default 0 recommended
train_samp: -1 recommended. Number of examples out of total to be used for training.
threshold: defualt recommended
num_iter: Number of Bootstrap Iterations. Default 20. Recommended 30.
nthread: Number of parallel thread for running XGB. Recommended number of cores in the CPU. Default 8.
bootstrap : True or False. If False, then num_iter is set to 1. One deterministic pval is outputted without averaging. If True, results are averaged over num_iter bootstraps and can have randomness. num_iter in this case has to be >= 20.
Output:
pvalue of the test.
tl;dr version
If the dimensions of X, Y, and Z are 1,1,2 respectively and if the first three i.i.d samples are as follows:
| X | Y | Z |
| 1.0 | 1.0 | 1.5 2.5 |
| 0.5 | 1.2 | 0.5 0.6 |
| 0.1 | 4.5 | 1.2 3.6 |
then the input is:
X = np.array([[1.0],[0.5],[0.1]])
Y = np.array([[1.0],[1.2],[4.5]])
Z = np.array([[1.5,2.5],[0.5,0.6],[1.2,3.6]])
pval = CCIT(X,Y,Z)
- CI_sampler_conditional_kNN()
Generate Test and Train set for converting CI testing into Binary Classification
Arguments:
X_in: Samples of r.v. X (np.array)
Y_in: Samples of r.v. Y (np.array)
Z_in: Samples of r.v. Z (np.array)
train_len: length of training set, must be less than number of samples
k: k-nearest neighbor to be used: Always set k = 1.
Output:
Xtrain: Features for training the classifier
Ytrain: Train Labels
Xtest: Features for test set
Ytest: Test Labels
CI_data: Developer Use only
DataGen Module
Functions:
- generate_samples_cos()
Generate CI,I or NI post-nonlinear samples:
1. Z is independent Gaussian
2. X = cos(<a,Z> + b + noise) and Y = cos(<c,Z> + d + noise) in case of CI
Arguments:
size : number of samples
sType: CI,I, or NI
dx: Dimension of X
dy: Dimension of Y
dz: Dimension of Z
nstd: noise standard deviation
freq: Freq of cosine function
Output:
allsamples --> complete data-set
Note that:
[X = first dx coordinates of allsamples each row is an i.i.d samples]
[Y = [dx:dx + dy] coordinates of allsamples]
[Z = [dx+dy:dx+dy+dz] coordinates of all samples]
- parallel_cos_sample_gen()
Function to create several many data-sets of post-nonlinear cos transform half of which are CI and half of which are NI, along with the correct labels. The data-sets are stored under a given folder path:
############## The path should exist#####################
For example create a folder ../data/dim20 first.
Arguments:
nsamples: Number of i.i.d samples in each data-set
dx, dy, dz : Dimension of X, Y, Z
nstd: Noise Standard Deviation
freq: Freq. of cos function
filetype: Path to filenames. if filetype = '../data/dim20/datafile', then the files are stored as '.npy' format in folder './dim20'
and the files are named datafile0_20.npy .....datafile50_20.npy
num_data: number of data files
num_proc: number of processes to run in parallel
Output:
num_data number of datafiles stored in the given folder.
datafile.npy files that constains an array that has the correct label. If the first label is '1' then 'datafile20_0.npy' constains a 'CI' dataset.