This repository contains the Python code for applying different k-anonymisation algorithms, i.e., Optimal Lattice Anonymization (OLA), Mondrian, Top-Down Greedy Anonymisation (TDG), k-NN Clustering-Based (CB) Anonymisation, on datasets and measuring their effects on Machine Learning (ML) classifiers as presented in k-Anonymity in Practice: How Generalisation and Suppression Affect Machine Learning Classifiers.
@article{slijepvcevic2021k,
title={k-Anonymity in practice: How generalisation and suppression affect machine learning classifiers},
author={Slijep{\v{c}}evi{\'c}, Djordje and Henzl, Maximilian and Klausner, Lukas Daniel and Dam, Tobias and Kieseberg, Peter and Zeppelzauer, Matthias},
journal={Computers \& Security},
volume={111},
pages={102488},
year={2021},
publisher={Elsevier}
}
In order to install the necessary requirements either use pipenv install
or pip3 install -r requirements.txt
.
Then activate the virtual environment, e.g. with pipenv shell
.
The code is written in Python 3 and conducts following steps for each experiment:
- read specified dataset
- measure specified ML algorithm performance using original dataset
- anonymise dataset with specified algorithm and current value k for k-anonymity
- measure specified ML algorithm performance using anonymised dataset
- repeat previous steps for other configured values of k
The parameters, i.e., dataset, ML algorithm, k-anonymisation algorithm, and k are defined via arguments as follows:
usage: baseline_with_repetitions.py [-h] [--start-k START_K] [--stop-k STOP_K] [--step-k STEP_K] [--debug] [--verbose] [{cmc,mgm,adult,cahousing}] [{rf,knn,svm,xgb}] {mondrian,ola,tdg,cb} ...
Anonymize data utilising different algorithms and analyse the effects of the anonymization on the data
positional arguments:
{cmc,mgm,adult,cahousing}
the dataset used for anonymization
{rf,knn,svm,xgb} machine learning classifier
{mondrian,ola,tdg,cb}
mondrian mondrian anonyization algorithm
ola ola anonyization algorithm
tdg tdg anonyization algorithm
cb cb anonyization algorithm
optional arguments:
-h, --help show this help message and exit
--start-k START_K initial value for k of k-anonymity
--stop-k STOP_K last value for k of k-anonymity
--step-k STEP_K step for increasing k of k-anonymity
--debug, -d enable debugging
--verbose, -v
The k-anonymisation algorithms "k-NN Clustering-Based Anonymisation", "Mondrian" and "Top-Down Greedy Anonymisation" located in the folders clustering_based
, basic_mondrian
and top_down_greedy
are based on the open-source implementation of Qiyuan Gong.
The original reporitories can be found on github.com:
Our changes include the migration of Python 2 to Python 3, the option to leave non-QID attributes and the target variable non-anonymised, the ability to handle float numbers in datasets, removal and cleanup of files and code that were irrelevant to our project.
The repository contains following locations for data:
datasets
- contains all available datasets in separate folders
generalization/hierarchies
- contains our defined generalization hierarchies per attribute and dataset
results
- all computed results (anonymised datasets, ML performance, etc.) are stored inside a folder structure inside
results
for each experiment
- all computed results (anonymised datasets, ML performance, etc.) are stored inside a folder structure inside
paper_results
- contains the results we used for analyses and plots in our paper
figures
- contains the figures used in our paper