Addressing Algorithmic Bias in Recidivism Score Predictions
This project aims to mitigate the inherent bias in recidivism score predictions generated by the National Institute of Justice. The existing algorithms tend to exhibit biases towards gender and racial/ethnic groups, which can have profound implications on the lives of individuals affected by these predictions. We are leveraging explainable machine learning techniques to rectify and minimize these biases.
Recidivism scores play a crucial role in the criminal justice system, but they can perpetuate societal biases. This project aims to develop a fair and unbiased algorithmic decision-making model, focusing on gender and racial/ethnic groups.
-
Bias Mitigation Techniques: Utilizing cutting-edge machine learning techniques to identify and mitigate biases in recidivism score predictions.
-
Fairness Evaluation Metrics: Implementing rigorous fairness evaluation metrics to assess the model's performance across different demographic groups.
-
Transparency and Explainability: Prioritizing transparency in the model's decision-making process and providing explanations for predictions to enhance accountability.
-
Data: All the datasets used in this study are accessible within the
data
folder. The primary datasets are identified with the prefixNIJ_s_Recidivism_Challenge,
comprising three test datasets and one training dataset. TheRecidivism_Full_Dataset.csv
consolidates these datasets by incorporating an additional column specifying whether each entry belongs to the training or test set. Furthermore, there are two cleaned versions of the dataset available:Recidivism_Data_Cleaned.csv
andRecidivism_Full_Dataset_cleaned_shreshth.csv.
These cleaned datasets represent the result of preprocessing steps. -
Analysis: All models developed either in jupiter notebook are available in
Modeling
folder:
-
Data Preparation: Refer to
data_prep.ipynb
for steps on cleaning and preparing the dataset. -
Model Development: Explore
Models.ipynb
for a detailed walkthrough of developing and evaluating machine learning models such as Logistic Regression, KNN, SVM, XGBoost, and CatBoost. -
Neural Network Models: Refer to
nn_model_all.ipynb
for insights into MLP neural network models. -
Logistic Regression Table: Execute
Logistic Regression Table.do
in STATA for detailed logistic regression results in tabular format for impact evaluation of different variables on the odds of recidisvism. -
CatBoost Model: Catboost gives the best performance. Utilize
/Modeling/trained_model/best_classifier_CatBoost.joblib
for the best-trained CatBoost machine learning model. -
CatBoost Information: Access additional information related to the CatBoost algorithm in the
/Modeling/catboost_info
folder. -
Trained Models: Find stored trained models in the
/Modeling/trained_models/
. -
Raw model runs: Review training steps and predictions in the
/Modeling/tacc/
folder.
Feel free to contribute, report issues, or suggest improvements. We welcome collaboration to enhance the robustness of recidivism prediction models.
This project was undertaken as a requirement for the Applied Machine Learning course offered by the Department of Electrical and Computer Engineering at the University of Texas at Austin, under the guidance of Professor Ghosh. The knowledge and skills gained in this course have been instrumental in the successful execution of this study.