Scientific Communications

An Introduction to Neural Networks
Warith HARCHAOUI
Young Statisticians and Probabilists, Institut Henri Poincaré
2018 
Deep Adversarial Gaussian Mixture AutoEncoder for Clustering
Warith HARCHAOUI, PierreAlexandre MATTEI and Charles BOUVEYRON
StatLearn Workshop, Université Lumière
2017
Slides, Poster  Distributed Garbage Collection in the NGrid Project
Joannès VERMOREL and Warith HARCHAOUI
NGrid is an open source (LGPL) grid computing framework written in C#.
Center for Computational Biology, MINES ParisTech
2005
Who Am I?
I am a French Data Scientist at Oscaro.com and a Machine Learning Ph.D. student under the supervision of Pr. Charles BOUVEYRON in the MAP5 lab at Université Paris Descartes.
CV/Resume
Academic Books for Machine Learning
21 October 2017
In random order:
 Pattern Recognition and Machine Learning (a.k.a PRML)
 Machine Learning: a Probabilistic Perspective
 Deep Learning
 Bayesian Reasoning and Machine Learning (a.k.a BRML)
 The Elements of Statistical Learning
 Artificial Intelligence: A Modern Approach
 Introduction to Algorithms
 Convex Optimization
 Algorithms
 Reinforcement Learning: An Introduction
An Introduction to Neural Networks
2 March 2017
In the MAP5 lab at Université Paris Descartes, there is a reading group for Ph.D. students and scientists from everywhere around Paris. In this context, I decided to prepare a talk named An Introduction to Neural Networks. You can find the slides here. This is about a wide variety of Neural Networks from one Neuron to Adversarial Neural Networks.
What is a Data Scientist?
10 February 2017
A data scientist is someone who is better at Statistics than any software engineer and better at Software Engineering than any statistician.
People often ask me about my job: Data Scientist. I humbly believe I know what a Data Scientist should know although I do not know everything about it. ;)
This post is a compilation of my favorite academic books on that subject:
 Pattern Recognition and Machine Learning (a.k.a PRML)
 Machine Learning: a Probabilistic Perspective
 Deep Learning
Here is a tutorials website which is fascinating:
Finally, you should know how to code in Python including Scikitlearn and Pandas which are well described in Python for Data Analysis (this is a great primer if you are a beginner in the field) and perhaps also R and Java and for many companies C# in practice.
Clustering Boosting
3 February 2017
Clustering is one of the most fundamental tasks in machine learning. This task consists in grouping similar objects together without any supervision, meaning there is no labels to guide this grouping. The goal is to find structure in data to facilitate further analysis. It is a desirable goal in data analysis, visualization and is often a preliminary step in many algorithms.
This post is a kind of comment of this article An Ensemble Method for Clustering.
In this post, I am agnostic to the favorite clustering algorithm you have chosen. Here I briefly enumerate some famous algorithms and associated articles I like very much:
 Hierarchical Clustering: Hierarchical Grouping to Optimize an Objective Function
 KMeans: 50 years beyond Kmeans
 Spectral Clustering: SelfTuning Spectral Clustering
 ...
 and variants...
All these algorithms are readytouse in Python in Scikitlearn. The common point of all clustering algorithms is the output that could be:
 a hard clustering: represented by a sequence of integer corresponding to the identifier (id) of each cluster.
 a soft clustering: represented by a probability table where each line corresponds to each element and each column corresponds to a probability of belonging to each clustering class (each line sums to one).
One can remark that hard clustering is a special case of soft clustering. Indeed, a soft clustering version of a hard one can be computed through a onehotvector representation. So, without loss of generality I can assume I have a soft clustering output.
Suppose you have a dataset of $N$ points to cluster into $K$ groups and you are given $M$ soft clustering outputs (from $M$ random initializations for example), the aim of this post is to find a way to aggregate these outputs to produce one final soft clustering output. In terms of optimization, one can write: $$\min_{p,\sigma} \frac{1}{NM} \sum_{m=1}^M \sum_{i=1}^N \Vert p(i)  C_m(\sigma_m(i)) \Vert_2^2$$
where:
 $p$ is the final soft clustering output ($p(i)_k$ is the probability of the $i^\text{th}$ point of belonging to the $k^\text{th}$ class )
 $C_m$ is one of the soft clustering input
 $\sigma_m$ is a permutation function over the indices between $1$ and $K$, solving the switching problem of the clustering table $C_m$ compared to $p$
The above formula is optimized greedily and the strategy is a heuristic which means that there is no way to guarantee the optimal solution yet. Unlike in Supervised Classification, Unsupervised Classification (a.k.a Clustering), the classes are anonymous. It means that if we permute the columns of a soft clustering, the clusterings (the original one and the permuted one) are still the same  they represent the same partition of the points but with different indices.
This socalled switching problem is here handled by the $\sigma$ functions. Indeed, if we did not have that switching problem, the solution would simply the mean of the different clustering probability tables. Fortunately, the Hungarian Method can solve this type of linear assignment problem based on the cooccurences matrix of the clustering probability tables.
To sum up, if one has 2 soft clustering, then, first one solve the switching problem between the classes' indices, second accumulate the mean probability table follwing this formula: $$p^{m+1} = p^{m} + \frac{1}{m+1} \left( C_m(\sigma_m)  p^{m} \right)$$ for $k$ from $1$ to $M$. Finally, the intuition behind this algorithm is that the most stable and hard clustering indices among the points will remain in the final output while limiting the impact of uncertain region. Here the notion of certainty could be measured with the entropy of each line of the probability tables.
Nonnegative Weights Linear Models for Classification
22 December 2016
We suppose that data are normalized (by mean and variance or between 0 and 1). For certain applications, such as diagonal Mahalanobis Metric Learning, it is useful to have a linearmodelimplementation code for classification that has nonnegative weights. Without sign constraints on the weights, there exists several nice outoftheshelves implementations freely accessible in the web, for example:
 SGD SVM to play with data (C/C++ from Léon Bottou!)
 LaSVM for binary linear classification (C/C++ from Léon Bottou too!)
 LibSVM when the number of features exceeds the number of example (C/C++, Matlab, Python etc.)
 LibLINEAR when the number of examples exceeds the number of features (C/C++, Matlab, Python etc.)
 SGDClassifier from Scikitlearn when one wants special regularization schemes such as L1/L2 norms (Python)
 Vowpal Wabbit (several languages including Python and R)
but none of them authorize a builtin mechanism to impose sign constraints on the weights.
One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost.
In this post, I would like to suggest another way to benefit from existing linear models implementations without modifying their codes. In a previous post (Regularized Bias in (Stochastic) Gradient Descent) I showed how to handle the regularized bias by adding a supplementary constant feature equal to $1000$. To impose nonnegative weights, we gain augment the data set by all the $B$hot vectors possible with $B >> 1$ (e.g. $B=1000$) but without bias. A $B$hot vector is a vector with zeros everywhere except for one bin which contains $B$. If we look at the SVM formulation, we have for weights $\mathbf{w}$, labels $y_i \in \{ 1,1 \}$ and examples $\mathbf{x}_i$
$$ \min_{\mathbf{w}} \frac{1}{2} \Vert \mathbf{w} \Vert^2_2$$ such that for each $i$ indexing the dataset $$y_i \mathbf{w}^\top \mathbf{x}_i \ge 1  \xi_i $$
If $\mathbf{x}_i$ is a $B$hot vector with a positive label ($y_i= 1$), then we get: $$\mathbf{w}_k \ge \frac{1 \xi_i}{B} \simeq 0$$ for each $k$ indexing the dimensions of $\mathbf{w}$. So, up to the slack variable $\xi_i$ which is most of the time zero or less than one, we have constrained the sign of $\mathbf{w}_k$. In practice, we can zero out the negative values of the libraries' output.
This slight modification of regular SVMs also works for logistic regression (because the hinge and logistic losses have almost the same behavior) and is easy for sparse representation as $B$hot vectors are ridiculously sparse.
Detection is not Classification
22 December 2015
There is a fondamental difference between Classification and Detection, let's say for the binary case. On the one hand, Classification distinguishes two welldefined classes and everything is symmetric. On the other hand, Detection aims to detect one class versus the rest of the world, so the problem is not symmetric. For example, getting a nice dataset for Classification is fairly easy, it is enough to sample points from the positive and the negative classes. But in the Detection case, it is very difficult to sample "enough" points from the negative cases because representing the rest of the world is fairly difficult. For example, in the latentSVM paper from Felzenszwalb et al for Object Detection there is a tedious Hard Negative Mining step to cope with this problem.
More recently, the Convolutional Neural Networks revolution started by the seminal work of Alex Krizhevsky's AlexNet, handles this problem by leveraging a huge number of classes (thousands) in a simulataneous OnevsRest fashion. As we speak, there is no explicit discriminative framework for Detection without Classification. Maybe the Oneclass SVM framework could do but I did not experiment with it.
Regularized Bias in (Stochastic) Gradient Descent
21 December 2015
It is useful to remind the reader that data should always be normalized, à la StandardScaler of scikitlearn (which substract the mean and divide by the standard deviation for each dimension of your data). In this context, the optimization papers almost never precise how to handle the Bias. In the excellent explanatory C++ code of Léon Bottou, we have several modes to regularize or not the Bias.
What does it mean exactly? Data Science practionners have often heard that the Bias is a detail and that it is enough to add a "1" dimension to the data. But in fact the Bias plays a crucial role during the optimization. For example, for Linear Support Vector Machines or Logistic Regression, the Bias sets the sign of the prediction driving the direction of the gradient step.
One way to handle the Bias is: add a "1000"dimension instead of a "1"dimension. This way, the true Bias found at the end of the optimization should be divided by 1000. In fact, the Bias will be regularized but just a little which is a tradeoff between an ease of optimization (no special case for the Bias) and having no regularization for the Bias (which is instable).