A Guide to Data Clustering Methods in Python
Clustering is the process of separating different pieces of data based on common characteristics. Disparate industries, including retail, finance, and healthcare, use grouping techniques for a variety of analytical tasks. In retail, aggregation can help identify distinct consumer populations, which can then allow a business to create targeted ads based on consumer demographics that may be too complicated to manually inspect. In finance, clustering can detect different forms of illegal activity in the market, such as order book spoofing in which traders deceit large orders to pressure other traders to buy. or sell an asset. In the health field, clustering methods have been used to determine patient cost patterns, earlyonset neurological disorders, and cancer gene expression.
Python offers many useful tools for performing cluster analysis. The best tool to use depends on the problem to be solved and the type of data available. Python offers three widely used techniques: Kmeans clustering, Gaussian mixture models, and spectral clustering. For relatively smallsized tasks (several dozen entries at most) such as identifying distinct consumer populations, Kmeans clustering is an excellent choice. For more complicated tasks such as detecting illegal market activity, a more robust and flexible model such as a Guassian mix model will be better suited. Finally, for large problems with potentially thousands of entries, spectral clustering is the best option.
In addition to selecting a suitable clustering algorithm for the problem, you should also have a way to assess the performance of these clustering algorithms. Typically, the average distance within the cluster from the center is used to assess model performance. More precisely, the average distance of each observation from the center of the cluster, called the centroid, is used to measure the compactness of a cluster. This makes sense because a good clustering algorithm should generate tightly packed groups of data. The closer the data points are to each other within a cluster, the better the results of the algorithm. The sum in the cluster distance plotted against the number of clusters used is a common way to assess performance.
For our purposes, we will perform a customer segmentation analysis on the customer segmentation of the shopping center The data.
Data clustering techniques in Python
 Grouping of Kmeans
 Gaussian mixture models
 Spectral aggregation
Data reading
Let’s start by reading our data in a Pandas dataframe:
import pandas as pd
df = pd.read_csv("Mall_Customers.csv")
print(df.head())
We see that our data is quite simple. It contains a column with customer IDs, gender, age, income and a column that denotes the expense score on a scale of 1 to 100. The objective of our grouping exercise will be to generate unique groups of clients, where each member of that group is more alike than members of other groups.
KMeans clustering
Kmeans clustering is a type of unsupervised machine learning, which means that the algorithm only trains on inputs and no outputs. It works by finding the distinct groups of data (i.e. clusters) that are closest to each other. Specifically, it partitions data into clusters in which each point belongs to a cluster whose average is closest to that data point.
Let’s import the Kmeans class from the clusters module into Scikitlearn:
from sklearn.clusters import KMeans
Next, let’s define the inputs we’ll use for our Kmeans clustering algorithm. Let’s use the age and expense score:
X = df[['Age', 'Spending Score (1100)']].copy()
The next thing we need to do is figure out how many clusters we will be using. We will use the elbow method, which plots the intracluster sum of squares (WCSS) against the number of clusters. We need to define a for loop that contains instances of the Kmeans class. This for loop will iterate over cluster numbers one through 10. We’ll also initialize a list that we’ll use to add the WCSS values:
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=0)
kmeans.fit(X)
We then add the WCSS values â€‹â€‹to our list. We access these values â€‹â€‹via the inertia attribute of the Kmeans object:
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.intertia_)
Finally, we can plot the WCSS as a function of the number of clusters. First, let’s import Matplotlib and Seaborn, which will allow us to create and format data visualizations:
import matplotlib.pyplot as plt
import seaborn as sns
Let’s style the plots using Seaborn:
sns.set()
Then plot the WCSS against the clusters:
plt.plot(range(1, 11), wcss)
Then add a title:
plt.title('Selecting the Numbeer of Clusters using the Elbow Method')
And finally, name the axes:
plt.xlabel('Clusters')
plt.ylabel('WCSS')
plt.show()
From this graph, we can see that four is the optimal number of clusters, as this is where the â€œbendâ€ of the curve appears.
We can see that Kmeans found four clusters, which break down as follows:

Young customers with a moderate spend score.

Young customers with a high spend score.

Middleaged customers with a low spend score.

Senior customers with a moderate spend score.
This type of information can be very useful for retail businesses looking to target specific demographics to consumers. For example, if most of the people with high spend scores are younger, the company can target those populations with ads and promotions.
Gaussian mixture model (GMM)
This model assumes that the clusters can be modeled using a Gaussian distribution. Gaussian distributions, informally known as bell curves, are functions that describe many important things like the size and weight of the population.
These models are useful because Gaussian distributions have welldefined properties such as mean, variance, and covariance. The average is simply the average value of an entry within a cluster. Variance measures the fluctuation in values â€‹â€‹for a single entry. Covariance is a matrix of statistics describing how inputs relate to each other and, more specifically, how they vary between them.
Collectively, these parameters allow the GMM algorithm to create flexible identity clusters of complex shapes. While Kmeans generally identifies clusters of spherical shape, GMM can more generally identify clusters of different shapes. This makes GMM more robust than Kmeans in practice.
Let’s start by importing the GMM package from Scikitlearn:
from sklearn.mixture import GaussianMixture
Next, let’s initialize an instance of the GaussianMixture class. Let’s start by considering three clusters and adapt the model to our inputs (in this case, age and expense score):
from sklearn.mixture import GaussianMixture
n_clusters = 3
gmm_model = GaussianMixture(n_components=n_clusters)
gmm_model.fit(X)
Now, let’s generate the cluster labels and store the results, along with our inputs, in a new data frame:
cluster_labels = gmm_model.predict(X)
X = pd.DataFrame(X)
X['cluster'] = cluster_labels
Next, let’s plot each cluster in a for loop:
for k in range(0,n_clusters):
data = X[X["cluster"]==k]
plt.scatter(data["Age"],data["Spending Score (1100)"],c=color[k])
And, finally, format the plot:
plt.title("Clusters Identified by Guassian Mixture Model")
plt.ylabel("Spending Score (1100)")
plt.xlabel("Age")
plt.show()
The red and blue clusters appear relatively well defined. The blue cluster represents young customers with a high spend score and the red represents young customers with a moderate spend score. The green group is less well defined because it covers all ages and low to moderate spending scores.
Now let’s try four clusters:
...
n_clusters = 4
gmm_model = GaussianMixture(n_components=n_clusters)
...
Although four groups show a slight improvement, the red and blue groups are still quite broad in terms of age score and expense values. So let’s try five clusters:
Five clusters seem appropriate here. They can be described as follows:

Young customers with a high spend score (green).

Young customers with a moderate spend score (black).

Young to middleaged customers with a low spend score (blue).

Middleaged to senior customers with a low spend score (yellow).

Middleaged to senior customers with a moderate spend score (red).
Gaussian mixture models are generally more robust and flexible than Kmeans clustering. Again, this is because GMM captures complex cluster shapes and not Kmeans. This allows GMM to accurately identify clusters more complex than spherical clusters identified by Kmeans. GMM is an ideal method for datasets of moderate size and complexity, as it is better able to capture clusters in sets with complex shapes.
Spectral clustering
Spectral clustering is a commonly used method for cluster analysis on large and often complex data. It works by performing dimensionality reduction on the input and generating clusters in the reduced dimensional space. Since our data doesn’t have a lot of entries, this will be mainly for illustration purposes, but it should be straightforward to apply this method to more complex and large datasets.
Let’s start by importing the SpectralClustering class from the cluster module into Scikitlearn:
from sklearn.cluster import SpectralClustering
Next, let’s define our SpectralClustering class instance with five clusters:
spectral_cluster_model= SpectralClustering(
n_clusters=5,
random_state=25,
n_neighbors=8,
affinity='nearest_neighbors'
)
Next, let’s define our model object for our inputs and store the results in the same data frame:
X['cluster'] = spectral_cluster_model.fit_predict(X[['Age', 'Spending Score (1100)']])
Finally, let’s plot our clusters:
fig, ax = plt.subplots()
sns.scatterplot(x='Age', y='Spending Score (1100)', data=X, hue="cluster", ax=ax)
ax.set(title="Spectral Clustering")
We see that clusters one, two, three and four are quite distinct while cluster zero looks quite large. Generally, we see some of the same patterns with groups of clusters as with Kmeans and GMM, although the previous methods gave better separation between the clusters. Again, spectral clustering is best suited for problems involving much larger datasets, such as those with hundreds to thousands of entries and millions of rows.
The code for this article is available on GitHub.
Add clustering to your toolbox
Although we have only considered cluster analysis in the context of customer segmentation, it is broadly applicable to a wide range of industries. The clustering methods we have discussed have been used to solve a wide range of problems. Kmeans pooling was used to identify vulnerable patient populations. Gaussian mixture models have been used to detect illegal market activities such as fraudulent trading, pump and dump and stuffing of quotes. Spectral grouping methods have been used to solve complex health problems such as grouping of medical terms for the discovery of health care knowledge.
No matter what industry, any modern organization or business can find great value in being able to identify important clusters from their data. Python provides many easytoimplement tools for performing cluster analysis at all levels of data complexity. In addition, having a good knowledge of the methods that work best given the complexity of the data is an invaluable skill for any data scientist. What we’ve covered provides a solid foundation for data scientists starting to learn how to perform cluster analysis.