Anomaly Detection in Credit Card Fraud Transactions

Project Overview

Credit card fraud detection is a critical task in financial institutions. The goal of this project is to build a system capable of identifying fraudulent transactions in a dataset where such cases are rare (0.17%). We use a multivariate normal distribution to model the probability density of legitimate transactions and classify those with lower densities as fraudulent.

Dataset

The dataset used in this project contains credit card transactions made by European cardholders in September 2013. It is highly imbalanced, with only 0.172% of transactions being fraudulent.

Download the dataset from Kaggle

Installation

Clone this repository:

git clone https://github.com/your-username/credit-card-fraud-detection.git
cd credit-card-fraud-detection

Install the required libraries:

pip install numpy pandas matplotlib seaborn scikit-learn tqdm psutil

Download the dataset from Kaggle and place it in the project directory.

Project Steps

1. Data Preprocessing

Dataset Loading: Load the dataset and split it into two main categories: legitimate transactions (Class = 0) and fraudulent transactions (Class = 1).
Splitting Data: The legitimate transactions are further divided into training, validation, and testing sets. The fraudulent data is split into validation and testing sets only.
Merging Validation and Testing Sets: Combine the legitimate and fraudulent data for validation and testing.

2. Feature Engineering

Time Feature Transformation: Decompose the Time feature into Day, Hour, Minute, and Second to extract more meaningful patterns.
Amount Transformation: Apply a log transformation to the Amount feature to reduce skewness and stabilize variance.

3. Data Visualization

Histogram and KDE Plots: Visualize the distribution of key features like Time, Hour, and Amount_transformed to understand the underlying data patterns.

Time - Hour

Amount - Transformed_amount

Features-Selection

4. Model Building

Multivariate Normal Distribution: Fit a multivariate normal distribution to the training data by calculating the mean and standard deviation of each selected feature.
Probability Density Calculation: Calculate the joint probability density function for the features, and classify transactions as fraudulent if their density falls below a certain threshold.

5. Threshold Tuning

Tuning the Threshold: Iterate over a range of alpha values and select the one that optimizes the F2-score, focusing on reducing false negatives due to the imbalanced dataset.

6. Evaluation

Confusion Matrix: Generate the confusion matrix to visualize the model's performance in classifying transactions.
Performance Metrics: Calculate accuracy, precision, recall, F1-score, F2-score, and MCC to evaluate the model.

Results

The optimal threshold value was approximately 3.87 x 10^-19, resulting in an F2-score of 0.836 on the validation set and 0.815 on the test set. The confusion matrix illustrates the classification performance.

Conclusion

This anomaly detection system effectively identifies fraudulent transactions by modeling legitimate transactions using a multivariate normal distribution. Key takeaways include:

Feature Selection: The choice of meaningful features significantly impacts model performance.
Threshold Optimization: Proper tuning of the decision threshold is essential for handling imbalanced datasets.
Real-World Application: The model achieves a high F2-score, making it suitable for deployment in real-world fraud detection systems.

Credits

This project uses the credit card fraud dataset from Kaggle.

Feel free to contribute to this project by submitting pull requests or suggesting improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
anomaly-detection-in-credit-card.ipynb		anomaly-detection-in-credit-card.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anomaly Detection in Credit Card Fraud Transactions

Table of Contents

Project Overview

Dataset

Installation