Scientific Communications
 Abstract of Ph.D. dissertation (in English and French)
Warith Harchaoui
MAP5, Université Paris Descartes
Supervised by Charles Bouveyron
Winter 2019/2020
 Une introduction aux réseaux de neurones (An Introduction to Neural Networks presentation in French)
Warith Harchaoui
Young Statisticians and Probabilists, Institut Henri Poincaré
MAP5, Université Paris Descartes
42AI, École 42
Winters 2017/2018 and 2018/2019

Wasserstein Adversarial Mixture Clustering (WAMiC)
Warith Harchaoui, PierreAlexandre Mattei, Andrés Almansa and Charles Bouveyron
Data Science Summer School, École Polytechnique
Summer 2018
Poster

Deep Adversarial Gaussian Mixture AutoEncoder for Clustering
Warith HARCHAOUI, PierreAlexandre Mattei and Charles Bouveyron
StatLearn Workshop, Université Lumière
April 2017
Slides, Poster
 Distributed Garbage Collection in the NGrid Project
Joannès Vermorel and Warith Harchaoui
NGrid is an open source (LGPL) grid computing framework written in C#
Center for Computational Biology, MINES ParisTech
Summer 2005
Who Am I?
I am Warith Harchaoui

a French Data Scientist at Oscaro.com since 2014

a Ph.D. candidate in Statistics under the supervision of Pr. Charles Bouveyron in the MAP5 lab at Université Paris Descartes from Sep. 2016 to today:
Learning Representations using
Neural Networks and Optimal Transport
Clustering Boosting
7 February 2020
Clustering is one of the most fundamental tasks in machine learning. This task consists in grouping similar objects together without any supervision, meaning there is no labels to guide this grouping. The goal is to find structure in data to facilitate further analysis. It is a desirable goal in data analysis, visualization and is often a preliminary step in many algorithms.
The subject of this blog post is to propose an answer to:
How can we combine several clustering algorithms outputs into one?
Indeed, as suggested by Andreas Weingessel et al. in An Ensemble Method for Clustering work, combining supervised classification outputs is wellstudied and called boosting popularized and acclaimed in face detection with AdaBoost by Paul Viola and Michael Jones in practice and there is a thorough analysis in the theoretical book The Elements of Statistical Learning by Trevor Hastie et al.
Back to unsupervised classification (a.k.a. clustering), there is a considerable wealth of research and we would not dare to hope being unbiased nor exhaustive:
While being agnostic to the clustering algorithms, if we want to combine their output, we must choose a common format to represent them between:
 hard clustering: represented by a sequence of integer corresponding to the identifier (id) of each cluster.
 soft clustering: represented by a probability table where each line corresponds to each element and each column corresponds to a probability of belonging to each clustering class (each line sums to one).
Suppose you have a dataset of $N$ points to group into $K$ clusters and you are given $M$ soft clustering outputs, we want to find a way to eventually produce only one clustering aggregating all the other $M$ clustering outputs. To that end, we choose to reformat all clustering outputs into soft clustering thanks to onehotencoding (i.e. degenerated soft histograms for hard clustering) and also thanks to zeropadding in case of varying number of clusters. So we only manipulate soft clustering outputs without loss of generality. Moreover, unlike in supervised classification, for unsupervised classification (a.k.a clustering), the classes are anonymous : for hard clustering, the 11122 sequence over data is exactly the same clustering as 22211. For soft clustering, this means that randomly permuting the columns of the table does only change the clustering representation but not the clustering itself and this columnpermutation invariance is called the switching label problem. This is a problem because it gives several representations for the same machine learning clustering object.
In terms of optimization, one can propose to cast this problem into: $$\min_{p,\sigma} \frac{1}{NM} \sum_{m=1}^M \sum_{i=1}^N \Vert p_i  \sigma_m(C_{m,i}) \Vert$$ where:
 $p \in \mathbb{R}^{N \times K}$ is the final soft clustering table. $p_i \in \mathbb{R}^K$ is the probability vector of the $i^\text{th}$ point of belonging to each cluster: $p_{i,k} = \mathbb{P}(ki)$
 $C \in \mathbb{R}^{M \times N \times K}$ is the tensor containing all the $M$ soft clustering outputs. $C_m \in \mathbb{R}^{N \times K}$ is the soft representation of the $m^\text{th}$ clustering outputs. Thus, $C_{m,i}$ is the probability vector of the $i^\text{th}$ point over $K$ clusters according to the $m^\text{th}$ clustering algorithm.
 $\sigma_m$ is a permutation function over the clustering indices between $1$ and $K$. Hopefully at convergence, $\sigma_m(C_{m,i})$ is the probability vector of the $i^\text{th}$ point over $K$ clusters without the switching label problem which means that comparaisons accross the $M$ algorithms are now reasonable thanks to $\sigma$.
If we take a closer look to the proposed optimization problem, then we can see that the distance is not specified. Empirically, it turns out that the Hellinger distance which is specifically designed for histograms (positive $1$sum vectors) is a good choice: $$ \forall a \in \mathbb{R}^K, b \in \mathbb{R}^K $$ $$\Vert a  b \Vert_\text{Hellinger} = \sum_{k=1}^K (\sqrt{a_k}  \sqrt{b_k})^2 = \Vert \sqrt{a}  \sqrt{b} \Vert_2^2 = 2 \times (1  \sqrt{a}^\top \sqrt{b})$$ Mathematically, the Hellinger distance makes the histogram space (i.e. the positive $L_1$ sphere simplex) look spherical (in a the $L_2$ traditional euclidean sense) thanks to the square root which is cool for even rudimentary tool like a hyperplane that has more interpretable separation power cutting an $L_2$ sphere than cutting an $L_1$ sphere simplex as shown in this figure below.
Our optimization problem is very similar to the one of KMeans: (i) if $\sigma$ is given, then the optimal $p^*$ is found by a mean of square root probabilities and (ii) if the $p$ is given, then the optimal $\sigma^*$ is given by a cheap bruteforce strategy to make the switching label problem disappear. Likewise, an alternated optimization scheme is efficient to (not optimally) solve our clustering aggregation (or boosting in reference to its supervised counterpart). Of course, there is no way to guarantee the optimal solution exactly like with KMeans but the solution quality cannot worsen w.r.t that objective until convergence in the same spirit as KMeans. Finally, we can also nicely adapt the KMeans++ strategy for careful initialization. Indeed, it seems we do KMeans over many simplistic or sophisticated clustering algorithms which is amusingly paying an ancient algorithm the respect that it deserves.
Favorite Academic Books
12 December 2019
In random order:
 Pattern Recognition and Machine Learning (a.k.a PRML)
 Machine Learning: a Probabilistic Perspective
 Deep Learning
 Bayesian Reasoning and Machine Learning (a.k.a BRML)
 ModelBased Clustering and Classification for Data Science (a.k.a. MBC)
 The Elements of Statistical Learning
 Artificial Intelligence: A Modern Approach
 Computer Vision: A Modern Approach
 Fundamentals of speech recognition
 A Wavelet Tour of Signal Processing
 Introduction to Algorithms
 Numerical Optimization
 Convex Optimization
 Nonlinear Programming
 Algorithms
 Reinforcement Learning: An Introduction
 Computational Optimal Transport
 Information Theory, Inference, and Learning Algorithms
 Introduction to Information Retrieval
 Foundations of Statistical Natural Language Processing
 A Primer on Neural Network Models for Natural Language Processing
Small Data
6 April 2019
In 2012, I was privileged enough to attend to DivideandConquer and Statistical Inference in Big Data given in French by Michael I. Jordan who is a prominent figure in Machine Learning. That visionary talk begun by establishing what can still be considered at the crux of Data Science today, especially back in a time when neither Hadoop nor Spark were popularized:
Consider $N$ the number of elements of the database (cardinality) and $D$ the dimensionality of each element.
 Big Data regime $\frac{N}{D} \gg 1$: Developper's nightmare especially because data may not fit in memory but if the computations are doable, then statistics background give reasons for the systems to work very well especially since the rise of stochastic optimization methods;
 Small Data regime $\frac{N}{D} \ll 1$ or $\frac{N}{D} \simeq 1$: Statistician's nightmare especially because of the mathematical curse of dimensionality but the computations are easier than previously.
NB: This has not been said that way by Michael I. Jordan but only simplified by myself
Since then, data storage problems have been nicely solved at least thrice thanks to the decrease of hardware's costs (and thus the increase of available computational power especially with the rise of GPUs), increase of data access speed and software solutions like HDFS and Spark (for both distributed file and inmemory computer systems) on the developper's side but the curse of dimensionality remains on the statistician's side. Computations have also been improved but not sufficiently energywise (that is beyond that current blog post scope).
Enforcing data structure seems to be an efficient way to cope with the curse of dimensionality issue. For example, if one considers the huge progress accomplished with images from the 1990s to the 2010s, then one sees that methods with hierarchical convolutions have been the key to unlock good classification results at the historical ImageNet competition ($\sim 10^6$ images database with $\sim 10^6$ pixels each thus $\frac{N}{D} \simeq 1$ which corresponds to a Small Data regime). Likewise, words ordering turns out to be the key to Natural Language Processing representations but despite worldwide Research attention in Genetics, Machine Learning tends to be uneffective yet: $1$ human being $D \simeq 10^{12}$ nucleotides and databases are limited to the worldwide population ($\sim 10^9$), thus $\frac{N}{D} \ll 1$ which corresponds to a Small Data regime once again. We did not yet leverage any desirable but unknown structure among those nucleotides to get an easier Big Data regime.
The takehomemessage of this post is:
 Big Data is easier than Small Data.
 Structure transforms Small Data problems into easier Big Data problems.
($D_{\text{new}} \ll D_{\text{old}}$)
Indeed, structure translates data knowledge and human expertise into an understandable statistical language that is more suitable for a computer. By the way, if $N$ is not large ($\sim 10^ 1$ or $\sim 10 ^ 2$) then statistical estimators do not have much relevance in such very Small Data regime ($N \leq 10 ^ 2$).
What is a Data Scientist?
10 February 2017
One saying goes: A data scientist is someone who is better at Statistics than any software engineer and better at Software Engineering than any statistician.
For beginners with some Math and Programming background, here are two books to get the right mix of skills between Statistics and Computer Sciences:
 Python for Data Analysis for a programming handson with Python
 Introduction to Machine Learning with Python: A Guide for Data Scientists for a concrete usage of Machine Learning
Useful microcourses for basic tools in a few hours: kaggle.com/learn
Programming skills in C#, R, Java (including Clojure) and even C/C++ for experts are bonuses.
Skills in Statistics and more importantly some French bon sens (common sense, discernment) seem mandatory before trying new fancy algorithms on realworld problems.
Nonnegative Weights Linear Models for Classification
22 December 2016
We suppose that data are normalized (by mean and variance or between 0 and 1). For certain applications, such as diagonal Mahalanobis Metric Learning, it is useful to have a linearmodelimplementation code for classification that has nonnegative weights. Without sign constraints on the weights, there exists several nice outoftheshelves implementations freely accessible in the web, for example:
 SGD SVM to play with data (C/C++ from Léon Bottou!)
 LaSVM for binary linear classification (C/C++ from Léon Bottou too!)
 LibSVM when the number of features exceeds the number of example (C/C++, Matlab, Python etc.)
 LibLINEAR when the number of examples exceeds the number of features (C/C++, Matlab, Python etc.)
 SGDClassifier from Scikitlearn when one wants special regularization schemes such as L1/L2 norms (Python)
 Vowpal Wabbit (several languages including Python and R)
but none of them authorize a builtin mechanism to impose sign constraints on the weights.
One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost.
In this post, I would like to suggest another way to benefit from existing linear models implementations without modifying their codes. In a previous post (Regularized Bias in (Stochastic) Gradient Descent) I showed how to handle the regularized bias by adding a supplementary constant feature equal to $1000$. To impose nonnegative weights, we gain augment the data set by all the $B$hot vectors possible with $B \gg 1$ (e.g. $B=1000$) but without bias. A $B$hot vector is a vector with zeros everywhere except for one bin which contains $B$. If we look at the SVM formulation, we have for weights $\mathbf{w}$, labels $y_i \in \{ 1,1 \}$ and examples $\mathbf{x}_i$
$$ \min_{\mathbf{w}} \frac{1}{2} \Vert \mathbf{w} \Vert^2_2$$ such that for each $i$ indexing the dataset $$y_i \mathbf{w}^\top \mathbf{x}_i \ge 1  \xi_i $$
If $\mathbf{x}_i$ is a $B$hot vector with a positive label ($y_i= 1$), then we get: $$\mathbf{w}_k \ge \frac{1 \xi_i}{B} \simeq 0$$ for each $k$ indexing the dimensions of $\mathbf{w}$. So, up to the slack variable $\xi_i$ which is most of the time zero or less than one, we have constrained the sign of $\mathbf{w}_k$. In practice, we can zero out the negative values of the libraries' output.
This slight modification of regular SVMs also works for logistic regression (because the hinge and logistic losses have almost the same behavior) and is easy for sparse representation as $B$hot vectors are ridiculously sparse.
Detection is not Classification
22 December 2015
There is a fondamental difference between Classification and Detection, let's say for the binary case. On the one hand, Classification distinguishes two welldefined classes and everything is symmetric. On the other hand, Detection aims to detect one class versus the rest of the world, so the problem is not symmetric. For example, getting a nice dataset for Classification is fairly easy, it is enough to sample points from the positive and the negative classes. But in the Detection case, it is very difficult to sample "enough" points from the negative cases because representing the rest of the world is fairly difficult. For example, in the latentSVM paper from Felzenszwalb et al for Object Detection there is a tedious Hard Negative Mining step to cope with this problem.
More recently, the Convolutional Neural Networks revolution started by the seminal work of Alex Krizhevsky's AlexNet, handles this problem by leveraging a huge number of classes (thousands) in a simulataneous OnevsRest fashion. As we speak, there is no explicit discriminative framework for Detection without Classification. Maybe the Oneclass SVM framework could do but I did not experiment with it.
Regularized Bias in (Stochastic) Gradient Descent
21 December 2015
It is useful to remind the reader that data should always be normalized, Ã la StandardScaler of scikitlearn (which substract the mean and divide by the standard deviation for each dimension of your data). In this context, the optimization papers almost never precise how to handle the Bias. In the excellent explanatory C++ code of Léon Bottou, we have several modes to regularize or not the Bias.
What does it mean exactly? Data Science practionners have often heard that the Bias is a detail and that it is enough to add a "1" dimension to the data. But in fact the Bias plays a crucial role during the optimization. For example, for Linear Support Vector Machines or Logistic Regression, the Bias sets the sign of the prediction driving the direction of the gradient step.
One way to handle the Bias is: add a "1000"dimension instead of a "1"dimension. This way, the true Bias found at the end of the optimization should be divided by 1000. In fact, the Bias will be regularized but just a little which is a tradeoff between an ease of optimization (no special case for the Bias) and having no regularization for the Bias (which is instable).
Miscellaneous
 Research System Administration of the MAP5 lab GPU servers for Deep Learning at Université Paris Descartes
Machine Learning and Python Installation  Good Music Playlist (with some French stuff too)
csv
Spotify  Failed attempt for a movie
Le petit Gauss