Clustering Algorithms: From Start To State Of The Art
K-Means Clustering
The algorithm begins by selecting k points as starting centroids (‘centers’ of clusters). We can just select any k random points, or we can use some other approach, but picking random points is a good start. Then, we iteratively repeat two steps:
Assignment step: each of m points from our dataset is assigned to a cluster that is represented by the closest of the k centroids. For each point, we calculate distances to each centroid, and simply pick the least distant one.
Update step: for each cluster, a new centroid is calculated as the mean of all points in the cluster. From the previous step, we have a set of points which are assigned to a cluster. Now, for each such set, we calculate a mean that we declare a new centroid of the cluster.
After each iteration, the centroids are slowly moving, and the total distance from each point to its assigned centroid gets lower and lower. The two steps are alternated until convergence, meaning until there are no more changes in cluster assignment. After a number of iterations, the same set of points will be assigned to each centroid, therefore leading to the same centroids again. K-Means is guaranteed to converge to a local optimum. However, that does not necessarily have to be the best overall solution (global optimum).
The final clustering result can depend on the selection of initial centroids, so a lot of thought has been given to this problem. One simple solution is just to run K-Means a couple of times with random initial assignments. We can then select the best result by taking the one with the minimal sum of distances from each point to its cluster – the error value that we are trying to minimize in the first place.
Other approaches to selecting initial points can rely on selecting distant points. This can lead to better results, but we may have a problem with outliers, those rare alone points that are just off that may just be some errors. Since they are far from any meaningful cluster, each such point may end up being its own ‘cluster’. A good balance is K-Means++ variant [Arthur and Vassilvitskii, 2007], whose initialization will still pick random points, but with probability proportional to
K-Means++ is also the default initialization for Python’s Scikit-learn K-Means implementation. If you’re using Python, this may be your library of choice. For Java, Weka library may be a good start:
Java (Weka)
// Load some data Instances data = DataSource.read("data.arff"); // Create the model SimpleKMeans kMeans = new SimpleKMeans(); // We want three clusters kMeans.setNumClusters(3); // Run K-Means kMeans.buildClusterer(data); // Print the centroids Instances centroids = kMeans.getClusterCentroids(); for (Instance centroid: centroids) { System.out.println(centroid); } // Print cluster membership for each instance for (Instance point: data) { System.out.println(point + " is in cluster " + kMeans.clusterInstance(point)); }
Python (Scikit -learn)
>>> from sklearn import cluster, datasets >>> iris = datasets.load_iris() >>> X_iris = iris.data >>> y_iris = iris.target >>> k_means = cluster.KMeans(n_clusters=3) >>> k_means.fit(X_iris) KMeans(copy_x=True, init='k-means++', ... >>> print(k_means.labels_[::10]) [1 1 1 1 1 0 0 0 0 0 2 2 2 2 2] >>> print(y_iris[::10]) [0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
In the Python example above, we used a standard example dataset ‘Iris’, which contains flower petal and sepal dimensions for three different species of iris. We clustered these into three clusters, and compared the obtained clusters to the actual species (target), to see that they match perfectly.
In this case, we knew that there were three different clusters (species), and K-Means recognized correctly which ones go together. But, how do we choose a good number of clusters k in general? These kind of questions are quite common in Machine Learning. If we request more clusters, they will be smaller, and
A way to deal with this problem is to include some penalty for a larger number of clusters. So, we are now trying to minimize not only the
Another thing we have to define properly is the distance function. Sometimes that’s a straightforward task, a logical one given the nature of data. For points in space, Euclidean distance is an obvious solution, but it may be tricky for features of different ‘units’, for discrete variables, etc. This may require a lot of domain knowledge. Or, we can call Machine Learning for help. We can actually try to learn the distance function. If we have a training set of points that we know how they should be grouped (i.e. points labeled with their clusters), we can use supervised learning techniques to find a good function, and then apply it to our target set that is not yet clustered.
The method used in K-Means, with its two alternating steps resembles an Expectation–Maximization (EM) method. Actually, it can be considered a very simple version of EM. However, it should not be confused with the more elaborate EM clustering algorithm even though it shares some of the same principles
EM Clustering
So, with K-Means
We now start the EM procedure by calculating, for each point, the probabilities of it belonging to each of the current clusters (which, again, may be randomly created at the beginning). This is the E-step. If one cluster is a really good candidate for a point, it will have a probability close to one. However, two or more clusters can be acceptable candidates, so the point has a distribution of probabilities over clusters. This property of the algorithm, of points not belonging restricted to one cluster is called soft clustering.
The M-step now recalculates the parameters of each cluster, using the assignments of points to the previous set of clusters. To calculate the new mean, covariance and weight of a cluster, each point data is weighted by its probability of belonging to the cluster, as calculated in the previous step.
Alternating these two steps will increase the total log-likelihood until it converges. Again, the maximum may be local, so we can run the algorithm several times to get better clusters.
If we now want to determine a single cluster for each point, we may simply choose the most probable one. Having a probability model, we can also use it to generate similar data, that is to sample more points that are similar to the data that we observed.
Affinity Propagation
Affinity Propagation (AP) was published by Frey and Dueck in
The main drawbacks of K-Means and similar algorithms are having to select the number of clusters, and choosing the initial set of points. Affinity Propagation, instead, takes as input measures of similarity between pairs of data
As an input, the algorithm requires us to provide two sets of data:
Similarities between data points, representing how well-suited a point is to be another one’s
exemplar . If there’s no similarity between two points, as in they cannot belong to the same cluster, this similarity can be omitted or set to -Infinity depending onimplementation .Preferences, representing each data point’s suitability to be an
exemplar . We may have some a priori information which points could be favored for this role, and so we can represent it through preferences.
Both similarities and preferences are often represented
The algorithm then runs through a number of iterations, until it converges. Each iteration has two message-passing steps:
Calculating responsibilities: Responsibility r(
i , k) reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for pointi , taking into account other potential exemplars for point i. Responsibility is sent from data pointi to candidateexemplar point k.Calculating availabilities: Availability
a( k) reflects the accumulated evidence for how appropriate it would be for pointi ,i to choose point k as itsexemplar , taking into account the support from other points that point k should be anexemplar . Availability is sent from candidateexemplar point k to point i.
In order to calculate responsibilities, the algorithm uses original similarities and availabilities calculated in the previous iteration (initially, all availabilities are set to zero). Responsibilities are set to the input similarity between point
Calculating availabilities, then, uses calculated responsibilities as evidence whether each candidate would make a good exemplar. Availability
Finally, we can have different stopping criteria to terminate the procedure, such as when changes in values fall below some threshold, or the maximum number of iterations is reached. At any point through Affinity Propagation procedure, summing Responsibility (r) and Availability (a) matrices gives us the clustering information we need: for point
We’ve seen that with K-Means and similar algorithms, deciding the number of clusters can be tricky. With AP, we don’t have to explicitly specify it, but it may still need some tuning if we obtain either more or
Hierarchical Affinity Propagation is also worth mentioning, as a variant of the algorithm that deals with quadratic complexity by splitting the dataset into a couple of subsets, clustering them separately, and then performing the second level of clustering.
In The End…
There’s a whole range of clustering algorithms, each one with its pros and cons regarding what type of data they with, time complexity, weaknesses, and so on. To mention more algorithms,
With Hierarchical Agglomerative Clustering, we can easily decide the number of clusters
Then, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is also an algorithm worth mentioning. It groups points that are closely packed together, expanding clusters in any direction where there are nearby points, thus dealing with different shapes of clusters.
These algorithms deserve an article of their own, and we can explore them in a separate post later on.