Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Choosing the parameter k for k-means clustering is a challenge
The elbow method is a common heuristic, but it is not reliable
Better alternatives have been known for a long time
Educators should discuss the problems of the elbow method and teach alternatives
Researchers and reviewers should reject conclusions drawn from the elbow method

Paper Content

Introduction

Cluster analysis is used to identify subgroups in data
No single definition of a cluster exists
Different algorithms are used to find the best solution
K-means clustering is the most used and taught clustering method
K-means is simple and runs quickly
Choosing the number of clusters is a key problem with k-means

K-means clustering

K-means clustering is a least-squares optimization problem.
It is a data quantization technique that approximates a data set of N objects in a continuous, d-dimensional vector space.
The standard algorithm has a complexity of O(N kdi).

The elbow criterion

Choosing the number of clusters (k) can be tricky
Elbow plot is a chart plotting the approximation error SSE on the y-axis over a range of values for k on the x-axis
Elbow method attributed to Thorndike
Problems associated with elbow plot

Elbow detection

Several attempts to formalize the notion of an “elbow” have been made in software and literature.
Sugar et al. proposed a “jump method” to find the max- where Y is a power parameter.
Salvador et al. proposed the L-method, which fits linear functions to the points before and after the break.
Satopää et al. proposed the Kneedle algorithm to measure the curvature.

Detection performance

Heuristics based on the geometric idea of an elbow point
Table 1 gives results for toy data sets
Heuristics sensitive to range of k analyzed
Root-mean-squared deviation (RMSD) more meaningful than SSE
Normalizing should preserve 0
Method should be able to choose k = 1 for data without meaningful clusters

Expected behavior of sse

We need to better understand the quantity we are working with
We should use a form of sample variance
We should use the square root of the quantity to make it more interpretable
We assume the input data is uniformly distributed in a single dimension
We propose to use the naïve estimate SSE1 /k as normalization factor
We can generate a standard deviation reduction plot to compare observed and estimated values
We can use cluster evaluation criteria such as Silhouette, VRC, and DB-Index

Variance-based criteria

Distance-based criteria

Dunn index compares the diameter of clusters to the cluster separation
Davies-Bouldin-Index compares the distance to the nearest other cluster with the radius of the two clusters
Average silhouette width measure compares the average distance of each point to its own cluster to the average distance to the nearest other cluster

Information-theoretic criteria

Principle of minimum description length is used to choose optimum number of clusters
Increasing number of centers means data is approximated more closely and more cluster centers need to be stored
X-means and G-means algorithms use Bayesian Information Criterion and Anderson-Darling tests respectively to decide when to accept a new cluster
K-means does not minimize Euclidean or Manhattan distance, but is often good enough for many applications

Simulation-based criteria

Gap statistic estimates baseline SSE k by clustering uniform random data sets
Gap statistic chooses k by comparing Gap k and Gap k−1 −s k+1
Gap statistic works decently well for synthetic data, but not for uniform data
Estimated number of clusters is unstable with default sample sizes
Suggest using VRC, BIC, or Gap statistic to choose k
Pay attention to preprocessing data for k-means

The true challenges of k-means

Choosing the parameter k is difficult for the user
Obtaining meaningful results from k-means is harder than expected
k-means assumes errors are the same across the data space
k-means assumes clusters have the same spherical shape
k-means does not work well in certain situations
k-means is not suitable for complex data

Conclusion

Elbow method is commonly used in education, online media, and clustering research
Alternatives such as VRC, BIC, and Gap statistics should be preferred
Problems of elbow approach have been discussed in literature
Educators should omit the method or explain better alternatives
Data scientists should not rely on evaluation measures to determine “best” solution

Link to paper#

Abstract#

Paper Content#

Introduction#

K-means clustering#

The elbow criterion#

Elbow detection#

Detection performance#

Expected behavior of sse#

Variance-based criteria#

Distance-based criteria#

Information-theoretic criteria#

Simulation-based criteria#

The true challenges of k-means#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

K-means clustering

The elbow criterion

Elbow detection

Detection performance

Expected behavior of sse

Variance-based criteria

Distance-based criteria

Information-theoretic criteria

Simulation-based criteria

The true challenges of k-means

Conclusion