Skip to content

This project implements an anomaly detection system to identify fraudulent transactions in a highly imbalanced dataset. The system is designed to flag suspicious transactions based on their deviation from the expected behavior of legitimate transactions.

Notifications You must be signed in to change notification settings

amMistic/Anomaly-detection-in-credit-card

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

image

Anomaly Detection in Credit Card Fraud Transactions


Table of Contents


Project Overview

Credit card fraud detection is a critical task in financial institutions. The goal of this project is to build a system capable of identifying fraudulent transactions in a dataset where such cases are rare (0.17%). We use a multivariate normal distribution to model the probability density of legitimate transactions and classify those with lower densities as fraudulent.


Dataset

The dataset used in this project contains credit card transactions made by European cardholders in September 2013. It is highly imbalanced, with only 0.172% of transactions being fraudulent.


Installation

  1. Clone this repository:

    git clone https://github.com/your-username/credit-card-fraud-detection.git
    cd credit-card-fraud-detection
  2. Install the required libraries:

    pip install numpy pandas matplotlib seaborn scikit-learn tqdm psutil
  3. Download the dataset from Kaggle and place it in the project directory.


Project Steps

1. Data Preprocessing

  • Dataset Loading: Load the dataset and split it into two main categories: legitimate transactions (Class = 0) and fraudulent transactions (Class = 1).

  • Splitting Data: The legitimate transactions are further divided into training, validation, and testing sets. The fraudulent data is split into validation and testing sets only.

  • Merging Validation and Testing Sets: Combine the legitimate and fraudulent data for validation and testing.

image image


2. Feature Engineering

  • Time Feature Transformation: Decompose the Time feature into Day, Hour, Minute, and Second to extract more meaningful patterns.

  • Amount Transformation: Apply a log transformation to the Amount feature to reduce skewness and stabilize variance.

image

image


3. Data Visualization

  • Histogram and KDE Plots: Visualize the distribution of key features like Time, Hour, and Amount_transformed to understand the underlying data patterns.

Time - Hour 17e72dd9-5823-495b-ba5a-adf4abf7855c

Amount - Transformed_amount c4a14b27-041c-425c-90c6-38f57a8b9125

Features-Selection 3b941b73-89fb-4df9-81d7-eaeece4889d2


4. Model Building

  • Multivariate Normal Distribution: Fit a multivariate normal distribution to the training data by calculating the mean and standard deviation of each selected feature.

  • Probability Density Calculation: Calculate the joint probability density function for the features, and classify transactions as fraudulent if their density falls below a certain threshold.

image


5. Threshold Tuning

  • Tuning the Threshold: Iterate over a range of alpha values and select the one that optimizes the F2-score, focusing on reducing false negatives due to the imbalanced dataset.

6. Evaluation

  • Confusion Matrix: Generate the confusion matrix to visualize the model's performance in classifying transactions.

  • Performance Metrics: Calculate accuracy, precision, recall, F1-score, F2-score, and MCC to evaluate the model.

image

c5c32346-510c-4f1d-9348-3eb7628ab3c0

c6e6a5a0-04fa-4c22-9796-fba7f0d3f339


Results

The optimal threshold value was approximately 3.87 x 10^-19, resulting in an F2-score of 0.836 on the validation set and 0.815 on the test set. The confusion matrix illustrates the classification performance.

42d0e4ca-a154-49d0-97da-a021e06d772a


Conclusion

This anomaly detection system effectively identifies fraudulent transactions by modeling legitimate transactions using a multivariate normal distribution. Key takeaways include:

  • Feature Selection: The choice of meaningful features significantly impacts model performance.
  • Threshold Optimization: Proper tuning of the decision threshold is essential for handling imbalanced datasets.
  • Real-World Application: The model achieves a high F2-score, making it suitable for deployment in real-world fraud detection systems.

Credits

This project uses the credit card fraud dataset from Kaggle.

Feel free to contribute to this project by submitting pull requests or suggesting improvements.


About

This project implements an anomaly detection system to identify fraudulent transactions in a highly imbalanced dataset. The system is designed to flag suspicious transactions based on their deviation from the expected behavior of legitimate transactions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published