## Clustering Boosting

*3 February 2017*

Clustering is one of the most fundamental tasks in machine learning. This task consists in grouping similar objects together without any supervision, meaning there is no labels to guide this grouping. The goal is to find structure in data to facilitate further analysis. It is a desirable goal in data analysis, visualization and is often a preliminary step in many algorithms.

This post is a kind of comment of this article An Ensemble Method for Clustering.

In this post, I am agnostic to the favorite clustering algorithm you have chosen. Here I briefly enumerate some famous algorithms and associated articles I like very much:

- Hierarchical Clustering: Hierarchical Grouping to Optimize an Objective Function
- K-Means: 50 years beyond K-means
- Spectral Clustering: Self-Tuning Spectral Clustering
- ...
- and variants...

All these algorithms are ready-to-use in Python in Scikit-learn. The common point of all clustering algorithms is the output that could be:

- a hard clustering: represented by a sequence of integer corresponding to the identifier (id) of each cluster.
- a soft clustering: represented by a probability table where each line corresponds to each element and each column corresponds to a probability of belonging to each clustering class (each line sums to one).

One can remark that hard clustering is a special case of soft clustering. Indeed, a soft clustering version of a hard one can be computed through a one-hot-vector representation. So, without loss of generality I can assume I have a soft clustering output.

Suppose you have a dataset of $N$ points to cluster into $K$ groups and you are given $M$ soft clustering outputs (from $M$ random initializations for example), the aim of this post is to find a way to aggregate these outputs to produce one final soft clustering output. In terms of optimization, one can write: $$\min_{p,\sigma} \frac{1}{NM} \sum_{m=1}^M \sum_{i=1}^N \Vert p(i) - C_m(\sigma_m(i)) \Vert_2^2$$

where:

- $p$ is the final soft clustering output ($p(i)_k$ is the probability of the $i^\text{th}$ point of belonging to the $k^\text{th}$ class )
- $C_m$ is one of the soft clustering input
- $\sigma_m$ is a permutation function over the indices between $1$ and $K$, solving the switching problem of the clustering table $C_m$ compared to $p$

The above formula is optimized greedily and the strategy is a heuristic which means that there is no way to guarantee the optimal solution yet. Unlike in Supervised Classification, Unsupervised Classification (a.k.a Clustering), the classes are anonymous. It means that if we permute the columns of a soft clustering, the clusterings (the original one and the permuted one) are still the same --- they represent the same partition of the points but with different indices.

This so-called switching problem is here handled by the $\sigma$ functions. Indeed, if we did not have that switching problem, the solution would simply the mean of the different clustering probability tables. Fortunately, the Hungarian Method can solve this type of linear assignment problem based on the co-occurences matrix of the clustering probability tables.

To sum up, if one has 2 soft clustering, then, first one solve the switching problem between the classes' indices, second accumulate the mean probability table follwing this formula: $$p^{m+1} = p^{m} + \frac{1}{m+1} \left( C_m(\sigma_m) - p^{m} \right)$$ for $k$ from $1$ to $M$. Finally, the intuition behind this algorithm is that the most stable and hard clustering indices among the points will remain in the final output while limiting the impact of uncertain region. Here the notion of certainty could be measured with the entropy of each line of the probability tables.