Fraudulent Transaction Detection on Credit Card Case

Fitrie Ratnasari
8 min readAug 25, 2021

Objective of this project is to predict anomaly detection in credit card transactions by applying various Machine/Deep Learning models and examine which best amongst all models.

Dataset Description & Initial Planning

Credit Card Ilustration

The dataset used from Kaggle in csv format. The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where 492 detected frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numeric input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, the original features and more background information about the data cannot be provided.

Features V1, V2, until V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise, as can be seen in detail below:

Objective of this project we would like to predict anomaly detection in credit card transactions. As introduced there are only 0.0172% of utter transactions that are detected as fraudulent transactions, therefore we should take care about the appropriate evaluation matrices should be used to avoid any imbalanced biases, in this case Recall and AUC (Area Under the Precision-Recall Curve) matrices would determined which best amongst all model examined.

Initial plans before doing further machine learning are seeing the data type thoroughly regarding its datatype and all dataset given whether they are appropriate, by then we can know subsequently what kind of data cleansing should be taken, and do the Exploratory Data Analysis before going into any machine learning or neural nets model.

Data Preparation

After acquiring the dataset, we see that the quality of the dataset is highly neat (there’s no missing value or mistype), thus does not require any data cleansing.

The only pre-processing used in Feature Scaling, that becomes one of the most important steps in many machine learning processes, since it can normalise the dispersion of data points, reduce flaws of outliers and make it easier for classification algorithms to make decision boundaries between them. In this research we use Standard Scaler to normalise.

Exploratory Data Analysis

Data visualisation on bar chart below shows that the dataset is highly imbalanced.

Imbalanced Dataset
In what amount transactions is more probable to fraud?

Now take a closer look on what amount of fraud transaction is probable to happen? We can see that the most probable fraud is around $300 — $400 and around $750.

Machine Learning

Here are we examining six different combination model to detect anomaly in credit card transaction, they are:

Dummy Classifier
Dummy Classifier is a classifier that makes predictions using simple rules.
This classifier is useful as a simple baseline to compare with other (real) classifiers, not be used in real case.

Result of this method is as follows: Recall Score is 0 and AUC is 0.5.

ROC AUC for Dummy Classifier

Vanilla Logistic Regression

Vanilla Logistic Regression means simple Logistic Regression: a variation of Linear Regression, useful when the observed dependent variable, ‘y’ is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables. So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability. This case use solver liblinear.

Result of this method is as follows, the Recall Score is 0.53 and AUC 0.96.

ROC AUC Curve for Vanilla Logistic Regression

Logistic Regression with Ridge Penalty

Another model we use is Logistic Regression with L2 Penalty or ridge, regularization here used to avoid overfitting on the model, and liblinear as solver. The result is quite impressive, Recall 0.95 and AUC 0.99.

ROC AUC Curve for Logistic Regression with Ridge L2 Penalty

SMOTE + Logistic Regression

The fourth model here we try to tackle imbalance dataset with oversampling the minority class, fraud. SMOTE technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”

SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

Specifically, a random example from the minority class is first chosen. Then k of the nearest neighbors for that example are found (typically k=5). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

This procedure can be used to create as many synthetic examples for the minority class as are required. As described in the paper, it suggests first using random undersampling to trim the number of examples in the majority class, then using SMOTE to oversample the minority class to balance the class distribution.

As can be seen in code below, original fraud class is 492 transaction and after do the SMOTE oversampling we have fraud class 294,315 transactions.

This work use k=4 neighbors and the result as follows, Recall 0.94 and AUC 0.96.

ROC AUC Curve for SMOTE + Logistic Regression

Undersampling + Multi Layer Perceptron Classifier

Constrast to Oversampling, in undersampling we use number of minority class as a maximum transaction in majority class, thus we eliminate rest of transactions majority class to be same the same as minority class. After we undersampling the majority class (normal transaction) we do Multi Layer Perceptron Classifier.

Class MLPClassifier implements a multi-layer perceptron (MLP) algorithm that trains using Backpropagation. It is a supervised learning algorithm that learns a function by training on a dataset, where is the number of dimensions for input and is the number of dimensions for output. Given a set of features and a target , it can learn a nonlinear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers.

Here are the undersampling + Multi Layer Perceptron Classifier result: Recall 0.89, and AUC 0.93.

ROC AUC Curve for Undersampling + MLPC

Autoencoder

AutoEncoder Architecture

An autoencoder is a special type of neural network that is trained to copy its input to its output. For example, given an image of a handwritten digit, an autoencoder first encodes the image into a lower dimensional latent representation, then decodes the latent representation back to an image. An autoencoder learns to compress the data while minimizing the reconstruction error. Autoencoder also can be used for non-image cases such as audio or text.

Autoencoder is trained to minimize reconstruction error. We will train an autoencoder on the normal transaction only, then use it to reconstruct all the data. We will then classify a transaction as an anomaly if the reconstruction error surpasses a fixed threshold.

In this case, we use 10 epoch, batch_size = 64, input_dim = 30 (number of features), encoding dimension 14, hidden dimension 7 and learning rate 1e-7. Summary of our multi Layer Perceptron as following:

Autoencoder model summary

With model loss result as shown below:

Autoencoder Model Loss

From the autoencoder model loss here we can say the model can do well in the train set since first epoch, but overfit in the test set even from the first epoch. Recall that the more complex our model, the more overfitting our result would be.

As we can predict the result pretty bad shows in below evaluation matrices: Recall 0.86 and AUC 0.5, the AUC Score just exactly the same as random classifier or dummy classifier

Comparison Result

From 6 various machine learning methods being executed (dummy classifier, vanilla logistics regression, Logistic Regression with Ridge Penalty, SMOTE+Log Reg, Undersampling + MLPC, and AutoEncoder) here are the recap of each evaluation matrices obtained:

Evaluation Matrices Comparison of Various Models

From the result above we can say the best model which obtained the highest percentage of Recall and AUC Score is Logistic Regression with Ridge, followed by SMOTE+Log Reg, and subsequently Undersampling + Multi Layer Perceptron Classifier. This is because Logistic Regression is a good model to predict probability of the class label as a function of the independent variables by passing the input through the logistic/sigmoid but then treats the result as a probability. Multi Layer Perceptron Classifier is also a good model for dataset since this neural networks have simple rules for back-propagation.

The worst model is on AutoEncoder, since this model needs to reconstruct data from model loss, and the fact that the more complex model will overfit outcome also be a consideration to choose the right model.

Summary

The dataset is about credit card transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where 492 detected frauds out of 284,807 transactions.

The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Eventhough dataset considered highly imbalanced, model used in this case successfully can predict greatly upon the fraud transaction, which obtained the highest percentage of Recall and AUC Score is Logistic Regression with Ridge (with Recall Score 0.95 and AUC Score 0.99), followed by SMOTE+Log Reg, and subsequently Undersampling + Multi Layer Perceptron Classifier. This is the case when a deep layer is not always the best solution to tackle a case. Considering this, author suggests also conducting another non-deep-learning model such as SVM, KNN, Boosting and Stacking, or XGBoost.

Source Code Archive

See this work in my Github !

Note: Any issues to see this github link, you can clone the repo through your terminal or see through here

References

IBM Machine Learning Foundation, Deep Learning & Reinforcement Learning, 2020

IBM Machine Learning Foundation, Supervised Learning: Classification, 2020

IBM Data Science For Professional: Machine Learning, 2020

https://www.tensorflow.org/tutorials/generative/autoencoder

https://scikit-learn.org/

https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

Kaggle.com

--

--