Customer Segmentation Report
Customer segmentation is a way to identify groups of similar customers. Customers can be segmented on a wide variety of characteristics, such as demographic information, purchase behaviour, and attitudes. This template provides an end-to-end report for processing and segmenting customer purchase data using a K-means clustering algorithm. It also includes a snake plot and heatmap to visualize the resulting clusters and feature importance.
- Multiple numerical variables that you can use for clustering.
- No NaN/NA values. You can use this to impute missing values if needed.
The dataset consists of customer data, including purchase recency, frequency, and monetary value. Each row represents a different customer with a distinct customer ID.
1. Loading packages and Inspecting the Data
The code below imports the packages necessary for data manipulation, visualization, pre-processing, and clustering. It also sets up the visualization style and loads in the data.
Finally, it inspects the data types and missing values with the .info()
method from pandas
.
# Load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
# Set visualization style
sns.set_style("darkgrid")
# Load the data and replace with your CSV file path
df = pd.read_csv("data/customer_data.csv")
# Preview the data
df
CustomerID | Recency | Frequency | MonetaryValue | |
---|---|---|---|---|
0 | 12747 | 3 | 25 | 948.70 |
1 | 12748 | 1 | 888 | 7046.16 |
2 | 12749 | 4 | 37 | 813.45 |
3 | 12820 | 4 | 17 | 268.02 |
4 | 12822 | 71 | 9 | 146.15 |
... | ... | ... | ... | ... |
3638 | 18280 | 278 | 2 | 38.70 |
3639 | 18281 | 181 | 2 | 31.80 |
3640 | 18282 | 8 | 2 | 30.70 |
3641 | 18283 | 4 | 152 | 432.93 |
3642 | 18287 | 43 | 15 | 395.76 |
3643 rows × 4 columns
# Check columns for data types and missing values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3643 entries, 0 to 3642
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 3643 non-null int64
1 Recency 3643 non-null int64
2 Frequency 3643 non-null int64
3 MonetaryValue 3643 non-null float64
dtypes: float64(1), int64(3)
memory usage: 114.0 KB
2. Exploring the Data
Based on the evaluation above, you can select columns you wish to inspect further. In this template, three columns are selected from the four columns. CustomerID is omitted because it is an identifier and not useful for clustering.
The code below reduces the DataFrame to the columns you wish to cluster on and then prints descriptive statistics using the describe()
method from pandas
.
Printing descriptive statistics is helpful because K-means clustering has several key assumptions that can be revealed via this exploration:
- There is no skewness to the data.
- The variables have the same average values.
- The variables have the same variance.
If you’d like to learn more about pre-processing data for K-means clustering, you can refer to this video from the course Customer Segmentation in Python.
# Select columns for clustering
columns_for_clustering = ["Recency", "Frequency", "MonetaryValue"]
# Create new DataFrame with clustering variables
df_features = df[columns_for_clustering]
# Print a summary of descriptive statistics
df_features.describe()
Recency | Frequency | MonetaryValue | |
---|---|---|---|
count | 3643.00000 | 3643.000000 | 3643.000000 |
mean | 90.43563 | 18.714247 | 370.694387 |
std | 94.44651 | 43.754468 | 1347.443451 |
min | 1.00000 | 1.000000 | 0.650000 |
25% | 19.00000 | 4.000000 | 58.705000 |
50% | 51.00000 | 9.000000 | 136.370000 |
75% | 139.00000 | 21.000000 | 334.350000 |
max | 365.00000 | 1497.000000 | 48060.350000 |
The facetgrid()
function from seaborn
creates a grid of histograms of the data to be clustered. It serves as a further exploration of the data to determine its skew and whether it needs transformation.
# Plot the distributions of the selected variables
g = sns.FacetGrid(
df_features.melt(), # Reformat the DataFrame for plotting purposes
col="variable", # Split on the 'variable' column created by reformating
sharey=False, # Turn off shared y-axis
sharex=False, # Turn off shared x-axis
)
# Apply a histogram to the facet grid
g.map(sns.histplot, "value")
# Adjust the top of the plots to make room for the title
g.fig.subplots_adjust(top=0.8)
# Create a title
g.fig.suptitle("Unprocessed Variable Distributions", fontsize=16)
plt.show()
Before proceeding, it is crucial to ensure that all columns selected for clustering are numeric. The following code iterates through the reduced DataFrame and checks whether each column is numeric. If it returns True
, then you can proceed with the pre-processing.
all([pd.api.types.is_numeric_dtype(df_features[col]) for col in columns_for_clustering])
True
3. Pre-processing the Data
Based on the grids above, if there is a skew, you will have to complete this step which removes the skew and center the variables. This is the case for the placeholder dataset used in this template and will likely be the case for your data.
- First, a log transformation is applied to the data using the
numpy
log()
function. A log transformation unskews the data in preparation for clustering. - Next, the
StandardScaler()
fromsklearn.preprocessing
fits and transforms the log-transformed data. This centers and scales the data in further preparation for clustering. - Finally, a new DataFrame is created and visualized again to confirm the results.
# Perform a log transformation of the data to unskew the data
df_log = np.log(df_features)
# Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(df_log)
# Scale and center the data
df_normalized = scaler.transform(df_log)
# Create a pandas DataFrame of the processed data
df_processed = pd.DataFrame(
data=df_normalized, index=df_features.index, columns=df_features.columns
)
# Plot the distributions of the selected variables
g = sns.FacetGrid(df_processed.melt(), col="variable")
g.map(sns.histplot, "value")
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle("Preprocessed Variable Distributions", fontsize=16)
plt.show()
4. Choosing the Number of Clusters
The next step is to fit a variable number of clusters and plot each cluster’s sum-of-squared errors (SSE). The SSE reflects the sum of squared distances from every data point to the cluster center. The aim is to reduce the SSE while still maintaining a reasonable number of clusters.
By plotting the SSE for each number of clusters, you can identify at what point there are diminishing returns by adding new clusters. This type of plot is called an elbow plot.
In the code below, you can set the maximum number of clusters you want to plot, and then a loop is used to generate the SSE for each number of clusters. Finally, the seaborn
function pointplot()
plots a curve with each cluster number and SSE. This allows you to identify the ‘elbow’ or point where there are only marginal reductions for each additional cluster.
# Set the maximum number of clusters to plot
max_clusters = 10
# Initialize empty dictionary to store sum of squared errors
sse = {}
# Fit KMeans and calculate SSE for each k
for k in range(1, max_clusters):
# Initialize KMeans with k clusters
kmeans = KMeans(n_clusters=k, random_state=1)
# Fit KMeans on the normalized dataset
kmeans.fit(df_processed)
# Assign sum of squared distances to k element of dictionary
sse[k] = kmeans.inertia_
# Initialize a figure of set size
plt.figure(figsize=(10, 4))
# Create an elbow plot of SSE values for each key in the dictionary
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
# Add labels to the plot
plt.title("Elbow Method Plot", fontsize=16) # Add a title to the plot
plt.xlabel("Number of Clusters") # Add x-axis label
plt.ylabel("SSE") # Add y-axis label
# Show the plot
plt.show()
5. Clustering the Data
You can now select an optimal number of clusters based on the elbow plot above by setting k
. In this example, k
is set to 3.
KMeans()
from sklearn.cluster
with k
clusters is then fit to the processed data, and the cluster labels are extracted and assigned back to the original data. This allows you to inspect raw data by cluster in later steps.
# Choose number of clusters
k = 3
# Initialize KMeans
kmeans = KMeans(n_clusters=k, random_state=1)
# Fit k-means clustering on the normalized data set
kmeans.fit(df_processed)
# Extract cluster labels
cluster_labels = kmeans.labels_
# Create a new DataFrame by adding a new cluster column to the original data
df_clustered = df.assign(Cluster=cluster_labels)
# Preview the clustered DataFrame
df_clustered
CustomerID | Recency | Frequency | MonetaryValue | Cluster | |
---|---|---|---|---|---|
0 | 12747 | 3 | 25 | 948.70 | 0 |
1 | 12748 | 1 | 888 | 7046.16 | 0 |
2 | 12749 | 4 | 37 | 813.45 | 0 |
3 | 12820 | 4 | 17 | 268.02 | 0 |
4 | 12822 | 71 | 9 | 146.15 | 2 |
... | ... | ... | ... | ... | ... |
3638 | 18280 | 278 | 2 | 38.70 | 1 |
3639 | 18281 | 181 | 2 | 31.80 | 1 |
3640 | 18282 | 8 | 2 | 30.70 | 1 |
3641 | 18283 | 4 | 152 | 432.93 | 0 |
3642 | 18287 | 43 | 15 | 395.76 | 2 |
3643 rows × 5 columns
6. Inspecting the Clusters
6a. Visualizing the Raw Values by Cluster
The next step is to analyze the unprocessed data by cluster. The pandas
method DataFrame.groupby()
, combined with the .size()
method, returns the total number of rows per Cluster
.
# Group the data by cluster and calculate the total number of rows per group
df_sizes = df_clustered.groupby(["Cluster"], as_index=False).size()
# Inspect the row counts
df_sizes
Cluster | size | |
---|---|---|
0 | 0 | 901 |
1 | 1 | 1156 |
2 | 2 | 1586 |
Next, the mean values per cluster are visualized. The data is grouped again, and this time, the pandas
method .mean()
is used to aggregate the data by cluster and calculate the mean for each variable. Alternatively, the .agg()
method can also be used to specify specific aggregations for different columns if necessary. Consult the documentation for further information on the types of aggregations possible.
The seaborn
catplot()
function visualizes the means per cluster.
# Calculate the mean of feature columns by cluster
df_means = df_clustered.groupby(["Cluster"])[df_features.columns].mean().reset_index()
# Plot the distributions of the selected variables
sns.catplot(
data=df_means.melt(id_vars="Cluster"), # Transform the data to enable plotting
col="variable",
x="Cluster",
y="value",
kind="bar",
)
# Add a title
plt.suptitle("Average Values by Cluster", y=1.04, fontsize=16)
# Show the plot
plt.show()
6b. Create a Snake Plot of the Clusters
The next step takes the processed data and visualizes the differences between the clusters using a snake plot. This can be helpful in spotting trends or key differences that would not be visible with the raw data.
# Assign cluster labels to processed DataFrame
df_processed_clustered = df_processed.assign(Cluster=cluster_labels)
# Melt the normalized DataFrame and reset the index
df_processed_melt = pd.melt(
df_processed_clustered.reset_index(),
# Assign the cluster labels as the ID
id_vars=['Cluster'],
# Assign clustering variables as values
value_vars=df_features.columns,
# Name the variable and value
var_name="Metric",
value_name="Value",
)
# Change the figure size
plt.figure(figsize=(10, 6))
# Add label and titles to the plot
plt.title('Snake Plot of Normalized Variables', fontsize=16)
plt.xlabel('Metric')
plt.ylabel('Average Normalized Value')
# Plot a line for each value of the cluster variable
sns.lineplot(data=df_processed_melt, x='Metric', y='Value', hue='Cluster')
plt.show()
6c. Create a Heatmap of Relative Importance
Another technique to help visualize how each segment is distinct is to plot the relative importance. The code below achieves this by doing the following:
- First, it calculates the average values for each cluster.
- Next, it calculates the average values for the total population.
- It then divides the cluster averages by the population averages and subtracts one.
This provides a relative importance score for each of the different features used for clustering. The seaborn
heatmap()
function plots these relative importances on a red-to-blue colour scale to help visualize the relative importance of each attribute to the segments.
# Calculate average RFM values for each cluster
cluster_avg = df_clustered.groupby(["Cluster"])[columns_for_clustering].mean()
# Calculate average RFM values for the total customer population
population_avg = df[columns_for_clustering].mean()
# Calculate relative importance of cluster's attribute value compared to the population
relative_imp = cluster_avg / population_avg - 1
# Change the figure size
plt.figure(figsize=(8, 4))
# Add the plot title
plt.title("Relative importance of attributes", fontsize=16)
# Plot the heatmap
sns.heatmap(data=relative_imp, annot=True, fmt=".2f", cmap=sns.diverging_palette(20, 220, as_cmap=True))
plt.show()
This concludes the report! Hope you find it enjoyable & insightful.