Customer Segmentation & Importance of Dimensionality Reduction in Case Credit Card/P2P Lending

11 min readFeb 10, 2021

One of essential work in any business area is targeting the right product for the right customers, so that it could boost up successful promotion, keeping up retention and subsequently increasing profit. This task is heavily relies on how well Data Scientist distinguish customers and create segmentation amongst all customer.

With the same idea, objective of this project we would like to do customer segmentation on Credit Card dataset by conducting various clustering models: KMeans, Hierarchical Agglomerative Clustering, DBSCAN, and Mean Shift model. The aim for doing this work is to create appropriate promotion on dedicated particular segments based on this research.

The dataset will be used from IBM Cloud Object Storage. It is related to Credit Card usage of customers that contains basic background profiling such as age, years employed, education, address and more credit information such as Card Debt, Other Debt, and Debt to Income Ratio. This work has been successfully distinct the customer into different clusters so that it can be analysed further and utilised in the business area.

The Dataset is in csv format, consisted of 850 entries (rows) with 9 attributes of various data types between integer, object and float, as picture shown below:

Data Cleansing & Feature Engineering

After acquiring the dataset we found that drop the unrelated variables are needed so that we are dropping the ‘address’ and ‘defaulted’ columns.

Feature Engineering being executed in this work are:

Log transformation for highly skewed data points.
As picture shown below, we found that Card Debt, Income, Other Debt, Education, Debt Income Ratio and Years Employed are skewed in right, so that we do log transform to make them normal distribution.

2. Feature scaling to all columns to make a way better performance in clustering, since distance metrics are the important things to do in any clustering model. In this work we use Standard Scalar to do the feature scaling.

3. Dimensionality Reduction
Eventhough the features is not that much, for this work we would like to see how useful using dimensionality reduction to the clustering performance. The models used are PCA (Principal Component Analysis) and Kernel PCA. So the scaled dataset would be transformed into PCA or Kernel PCA form to do further segmentation or clustering.

The idea of PCA is to maintain the variance of all features while decomposing the variables, and same with PCA, Kernel PCA doing the same while decomposing by do kerneling into higher dimension.

Both PCA and Kernel PCA in this dataset have similar outcomes of variance of each component as shown below, meaning that by only using 4 features we can maintain the variance of all variables as much as 90%. So later when using PCA or Kernel PCA, we consistently use 4 components.

The picture shown below tells Feature Importance in various dimensions.

Exploratory Data Analysis

Distribution of Each Variable, Correlation between Variables, and Skewness

By using Pairplot we can easily obtain the visualization data from distribution of each variables in the dataset and also relationship or how correlated one variable to other variable in histogram form.

Clustering Models

KMeans

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields. 4

Before conducting KMeans clustering algorithm we need to find out the best number of cluster or K, that can generate the highest Silhouette Score. Silhouette score is one of quantitative evaluation performance for clustering (well suited when the ground truth label is not known), where a higher Silhouette Coefficient score relates to a model with better defined clusters.

Silhouette score is higher when clusters are dense and well separated, which relates to a standard concept of a cluster.

a. KMeans without Dimensionality Reduction

Can be seen in the result shown below that the cluster with k=3 is the best hyperparameter to go with (we avoid 2 clusters for the sake of business acumen), since it has the highest silhouette score.

After conducting KMeans clustering with K=3, the result are:

With the Silhouette Score : 0.20, indicating there still be a few overlapping clusters and has a low density for each cluster.

b. KMeans use PCA as dimensionality Reduction

By applying PCA as dimensionality reduction we found interesting fact that dimensionality reduction can make a better performance of clustering and subsequently enhance its interpretability as can be seen in picture below:

Above result can be obtained by having the optimum number of k, by calculating the maximum result of Silhouette Score as picture below, which we chose 3 as the optimum clusters.

Then finally the quantitative performance are being calculated and obtained 0.23 of its Silhouette Score, which it is way better than K Means without using dimensionality reduction.

c. KMeans use Kernel PCA as dimensionality Reduction

It is also an interesting fact that Kernel PCA can work much better rather than conventional KMeans or KMeans used PCA as its dimensionality reduction, as can be seen in picture below as result:

We can see the result above by using Kernel PCA is a way more dense and very few to have an overlapping cluster. As it is congruent with the Silhouette Score reach 0.27, higher compared to conventional KMeans and KMeans using PCA as dimensionality reduction.

2. Hieararchical Agglomerative Clustering

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:

Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.
Average linkage minimizes the average of the distances between all observations of pairs of clusters.
Single linkage minimizes the distance between the closest observations of pairs of clusters.

Similar to the KMeans, the hyperparameter of Agglomerative can be found by choosing the maximum Silhouette Score. In this case we use Kernel PCA as dimensionality reduction since we knew it outperformed other ways.

Below is the result shows that n=5 clusters is the optimum clusters (again, we avoid 2 clusters for the sake of business acumen).

After using this model, the cluster generates 5 cluster as right picture above shows, and the dendogram can be seen in picture below:

Below is 3D Visualizations of using Agglomerative Clustering. Look very similar with the KMeans since using ward as linkage criteria. With the silhouette score also not far from KMeans: 0.23.

3. Density Based Spatial Clustering Application with Noise (DBSCAN)

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples). There are two parameters to the algorithm, min_samples and eps, which define formally what we mean when we say dense. Higher min_samples or lower eps indicate higher density necessary to form a cluster.

More formally, we define a core sample as being a sample in the dataset such that there exist min_samples other samples within a distance of eps, which are defined as neighbors of the core sample. This tells us that the core sample is in a dense area of the vector space. A cluster is a set of core samples that can be built by recursively taking a core sample, finding all of its neighbors that are core samples, finding all of their neighbors that are core samples, and so on. A cluster also has a set of non-core samples, which are samples that are neighbors of a core sample in the cluster but are not themselves core samples. Intuitively, these samples are on the fringes of a cluster.

Any core sample is part of a cluster, by definition. Any sample that is not a core sample, and is at least eps in distance from any core sample, is considered an outlier by the algorithm.

With the same procedure beforehand we need to choose the most optimum hyperparameter will be used in the model from the highest Silhouette Score, as result below, k is the minimum sample number in DBSCAN:

So that we choose min_sample= 3, and epsilon 1.3 to generate an appropriate number of clusters in the business acumen. After conducting the DBSCAN modelling on the dataset the results are shown in 3D visualization below, and the right table shows that 5 clusters have been generated from the optimum hyperparameter chosen.

The silhouette score obtained is 0.10, as can be seen in the visualization above there are alot of overlapping clusters compared to previous 2 models.

4. Mean Shift

MeanShift clustering aims to discover blobs in a smooth density of samples. It is a centroid based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

Mean-shift builds upon the concept of kernel density estimation is sort KDE. KDE is a method to estimate the underlying distribution also called the probability density function for a set of data.

It works by placing a kernel on each point in the data set. A kernel is a fancy mathematical word for a weighting function generally used in convolution. There are many different types of kernels, but the most popular one is the Gaussian kernel. Adding up all of the individual kernels generates a probability surface example density function. Depending on the kernel bandwidth parameter used, the resultant density function will vary.

The algorithm automatically sets the number of clusters, instead of relying on a parameter bandwidth, which dictates the size of the region to search through. This parameter can be set manually, but can be estimated using the provided estimate_bandwidth function, which is called if the bandwidth is not set.

The algorithm is not highly scalable, as it requires multiple nearest neighbor searches during the execution of the algorithm. The algorithm is guaranteed to converge, however the algorithm will stop iterating when the change in centroids is small.

With the same procedure beforehand we need to choose the most optimum hyperparameter to be used in the model, we choose bandwidth = 2 as it still makes sense in business acumen to have 3 clusters and makes it easier to compare various models later on.

The Silhouette score of the model is 0.271, meaning there are a few occurrences of overlapping clusters, and are as good as KMeans using Kernel PCA.

Comparison on Various Clustering Performance

After conducting 4 different models, it is interesting if we can come up the quantitative performance differences amongst them as can be seen in table below:

*number of clusters generated from exercising different combinations of hyperparameters to obtain the highest Silhouette Score and number of clusters are appropriate in business acumen.

Key Findings and Insight

From result above, here are the key takeaways or insight have been obtained:

1. Dimensionality reduction works really well to enhance clustering performance for the dataset, which Kernel PCA took the highest performance, followed by PCA.
2. Compared to conventional models which does not use dimensionality reduction, model with dimensionality reduction creates clustering have more dense in each cluster, creates bold distinct to other clusters, and lessens the overlapping cluster, so that the interpretability also becomes easier to understand intuitively.
3. Mean Shift algorithm has as good clustering performance as K Means with Kernel PCA, since Mean Shift underlies the Kernel Density Estimation to distinguish between clusters.
4. DBSCAN became the worst clustering model for the dataset. The reason behind this lowest performance score of DBSCAN is that one of DBSCAN weaknesses does not do well with clusters of different density, whereas the dataset does.
5. To understand better how well a model works for the dataset, here are the 3D animation for various clustering result used:

KMeans algorithm without dimensional reduction. Silhouette score obtained: 0.20

KMeans algorithm using PCA as dimensional reduction. Silhouette score obtained: 0.23

KMeans algorithm using Kernel PCA as dimensional reduction. Silhouette score obtained: 0.27

Hierarchical Algomerative Clustering algorithm using Kernel PCA as dimensional reduction. Silhouette score obtained: 0.20

DBSCAN algorithm. Silhouette score obtained: 0.10

Mean Shift algorithm. Silhouette score obtained: 0.27

Suggestion

There are numerous suggestions to take the analysis going further:

Using the dataset that also involved Recency, Frequency and Monetary of each customer so that we can compare those 4 various models performance with the well-known RFM Clustering Analysis.
Using Dimensionality Reduction such as Isomap and TSNE and see the performance result compared to the PCA and Kernel PCA.

Source Code Archive

See here to see the source code in my GitHub.

References

IBM Machine Learning Foundation, Unsupervised Learning, 2020
IBM Data Science For Professional: Machine Learning, 2020
Dataset: https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-Coursera/labs/Data_files/Cust_Segmentation.csv
KMeans Clustering:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Hierarchical Agglomerative Clustering:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
DBSCAN:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Mean Shift:https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html
Silhoutte Score: https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
Skewness in Statistics: https://www.statisticshowto.com/probability-and-statistics/skewed-distribution/#:~:text=A%20left%2Dskewed%20distribution%20has,has%20a%20long%20right%20tail.
Animation 3D Rotation:
http://blog.mahler83.net/2019/10/rotating-3d-t-sne-animated-gif-scatterplot-with-matplotlb
Mean Shift Clustering:
https://www.geeksforgeeks.org/ml-mean-shift-clustering/