Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Choosing the parameter k for k-means clustering is a challenge
- The elbow method is a common heuristic, but it is not reliable
- Better alternatives have been known for a long time
- Educators should discuss the problems of the elbow method and teach alternatives
- Researchers and reviewers should reject conclusions drawn from the elbow method
Paper Content
Introduction
- Cluster analysis is used to identify subgroups in data
- No single definition of a cluster exists
- Different algorithms are used to find the best solution
- K-means clustering is the most used and taught clustering method
- K-means is simple and runs quickly
- Choosing the number of clusters is a key problem with k-means
K-means clustering
- K-means clustering is a least-squares optimization problem.
- It is a data quantization technique that approximates a data set of N objects in a continuous, d-dimensional vector space.
- The standard algorithm has a complexity of O(N kdi).
The elbow criterion
- Choosing the number of clusters (k) can be tricky
- Elbow plot is a chart plotting the approximation error SSE on the y-axis over a range of values for k on the x-axis
- Elbow method attributed to Thorndike
- Problems associated with elbow plot
Elbow detection
- Several attempts to formalize the notion of an “elbow” have been made in software and literature.
- Sugar et al. proposed a “jump method” to find the max- where Y is a power parameter.
- Salvador et al. proposed the L-method, which fits linear functions to the points before and after the break.
- Satopää et al. proposed the Kneedle algorithm to measure the curvature.
Detection performance
- Heuristics based on the geometric idea of an elbow point
- Table 1 gives results for toy data sets
- Heuristics sensitive to range of k analyzed
- Root-mean-squared deviation (RMSD) more meaningful than SSE
- Normalizing should preserve 0
- Method should be able to choose k = 1 for data without meaningful clusters
Expected behavior of sse
- We need to better understand the quantity we are working with
- We should use a form of sample variance
- We should use the square root of the quantity to make it more interpretable
- We assume the input data is uniformly distributed in a single dimension
- We propose to use the naïve estimate SSE1 /k as normalization factor
- We can generate a standard deviation reduction plot to compare observed and estimated values
- We can use cluster evaluation criteria such as Silhouette, VRC, and DB-Index
Variance-based criteria
Distance-based criteria
- Dunn index compares the diameter of clusters to the cluster separation
- Davies-Bouldin-Index compares the distance to the nearest other cluster with the radius of the two clusters
- Average silhouette width measure compares the average distance of each point to its own cluster to the average distance to the nearest other cluster
Information-theoretic criteria
- Principle of minimum description length is used to choose optimum number of clusters
- Increasing number of centers means data is approximated more closely and more cluster centers need to be stored
- X-means and G-means algorithms use Bayesian Information Criterion and Anderson-Darling tests respectively to decide when to accept a new cluster
- K-means does not minimize Euclidean or Manhattan distance, but is often good enough for many applications
Simulation-based criteria
- Gap statistic estimates baseline SSE k by clustering uniform random data sets
- Gap statistic chooses k by comparing Gap k and Gap k−1 −s k+1
- Gap statistic works decently well for synthetic data, but not for uniform data
- Estimated number of clusters is unstable with default sample sizes
- Suggest using VRC, BIC, or Gap statistic to choose k
- Pay attention to preprocessing data for k-means
The true challenges of k-means
- Choosing the parameter k is difficult for the user
- Obtaining meaningful results from k-means is harder than expected
- k-means assumes errors are the same across the data space
- k-means assumes clusters have the same spherical shape
- k-means does not work well in certain situations
- k-means is not suitable for complex data
Conclusion
- Elbow method is commonly used in education, online media, and clustering research
- Alternatives such as VRC, BIC, and Gap statistics should be preferred
- Problems of elbow approach have been discussed in literature
- Educators should omit the method or explain better alternatives
- Data scientists should not rely on evaluation measures to determine “best” solution