Skip to content

YutongWangML/UCI_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 

Repository files navigation

UCI_classification

UCI classification datasets (csv of metadata included datasets)loading utilities.

Setup/Installation

Method 1: download then install

Navigate to the root directory of this repository, then run the following in shell.

pip install .

Method 2: pip install from github

pip install git+https://github.com/YutongWangUMich/UCI_classification.git

Functionalities/Usage examples

Load a dataset

import UCI_classification as uci
Xtrn, ytrn = uci.datasets.load_trn('iris', return_X_y=True)

Get metadata of datasets

import UCI_classification as uci
all_datasets = uci.datasets.get_data_list('all')
uci.datasets.get_metadata(all_datasets)

Output:

           dataset_name  n_samples  n_train  n_test  n_features  n_classes  L2_dist_est
0               abalone       4177     3133    1044           8          3     8.669142
1    acute-inflammation        120       90      30           6          2    12.625558
2       acute-nephritis        120       90      30           6          2    12.537478
3                 adult      48842    32561   16281          14          2    23.300632
4             annealing        898      798     100          31          5    49.967534
..                  ...        ...      ...     ...         ...        ...          ...
116                wine        178      134      44          13          3    25.647811
117    wine-quality-red       1599     1199     400          11          6    16.271732
118  wine-quality-white       4898     3674    1224          11          7    18.127150
119               yeast       1484     1113     371           8         10     9.578626
120                 zoo        101       76      25          16          7    34.207623

[121 rows x 7 columns]

Note, see section below regarding the column L2_dist_est.

Files and directories

Overview of the content in the UCI_classification module subdirectory:

  • UCI_classification/data_py.zip: archive of the 121 UCI classification datasets originally downloaded from http://www.bioinf.jku.at/people/klambauer/data_py.zip hosted by Dr. Klambauer. These 121 UCI datasets were originally analyzed in [Fernández-Delgado et al., 2014].
  • UCI_classification/data_lists/*.txt: curated sublists of datasets studied in the literature.
  • UCI_classification/metadata/*.csv: metadata about each of the datasets.

Raw data files

The datasets are stored in the directory UCI_classification/data/ which is unzipped from UCI_classification/data_py.zip after installation.

The format is the same as in the original compilation in https://github.com/bioinf-jku/SNNs and is shown below:

data/
├── DATASETNAME/
│   ├── DATASETNAME_py.dat           # feature vectors
│   ├── folds_py.dat                 # the train and test split
│   ├── labels_py.dat                # labels (0,1,...,num_classes)
│   └── validation_folds_py.dat      # 4-fold splits of the training set
└── ...

where DATASETNAME stands for abalone, iris, etc.

Curated lists of datasets

Lists of datasets

data_list/
├── all.txt                          # the list all datasets
├── arora2019harnessing.txt          # 90 datasets used in [Arora et al, 2019]
├── fathony2016adversarial.txt       # 12 datasets used in [Fathony et al, 2016]
└── testing.txt                      # three datasets: [iris, ionosphere, wine]. Useful for testing.
name there name here # class # samp # trn # tst # feat
1 iris iris 3 150 105 45 4
2 glass glass 6 214 149 65 9
3 redwine wine-quality-red 10 1599 1119 480 11
4 ecoli ecoli 8 336 235 101 7
5 vehicle statlog-vehicle 4 846 592 254 18
6 segment image-segmentation 7 2310 1617 693 19
7 sat statlog-landsat 7 6435 4435 2000 36
8 optdigits optical 10 5620 3823 1797 64
9 pageblocks page-blocks 5 5473 3831 1642 10
10 libras libras 15 360 252 108 90
11 vertebral vertebral-column-3clases 3 310 217 93 6
12 breatstissue breast-tissue 6 106 74 32 9

We note that the numbers reported in columns # trn and # tst are directly taken from their Table 1, which does not match the train-test split in this repository. See Fathony et al. 2016 Section 4 for details on their experimental setup.

Metadata about the datasets metadata/

metadata/
├── datasets_summary.csv              # a table of summary of essential informations of each dataset
└── L2_dist_est.csv                   # precomputed values of L2_dist_est, see code chunk below and [Shankar et al 2020].

On L2_dist_est

This is a dataset level numerical value computed originally in Shankar et al 2020. The computation is performed as follows:

dist_est = kernel.est_dist(x_train, 1000)
# See https://github.com/modestyachts/neural_kernels_code/blob/0202718ce8da87f7c1682a6fd87f0caeeaba0859/UCI/UCI.py#L80
# The function est_dist is from 
# https://github.com/modestyachts/neural_kernels_code/blob/0202718ce8da87f7c1682a6fd87f0caeeaba0859/UCI/kernel.py

About

UCI classification datasets loading utilities.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages