Non-negative Weights Linear Models for Classification

22 December 2016

We suppose that data are normalized (by mean and variance or between 0 and 1). For certain applications, such as diagonal Mahalanobis Metric Learning, it is useful to have a linear-model-implementation code for classification that has non-negative weights. Without sign constraints on the weights, there exists several nice out-of-the-shelves implementations freely accessible in the web, for example:

but none of them authorize a built-in mechanism to impose sign constraints on the weights.

One solution could be to modify these implementations and do a Projected Gradient Descent in the primal by zeroing the negative weights at each step. But this process is inconvenient because the nice convergence properties are then lost.

In this post, I would like to suggest another way to benefit from existing linear models implementations without modifying their codes. In a previous post (Regularized Bias in (Stochastic) Gradient Descent) I showed how to handle the regularized bias by adding a supplementary constant feature equal to $1000$. To impose non-negative weights, we gain augment the data set by all the $B$-hot vectors possible with $B >> 1$ (e.g. $B=1000$) but without bias. A $B$-hot vector is a vector with zeros everywhere except for one bin which contains $B$. If we look at the SVM formulation, we have for weights $\mathbf{w}$, labels $y_i \in \{ -1,1 \}$ and examples $\mathbf{x}_i$

$$ \min_{\mathbf{w}} \frac{1}{2} \Vert \mathbf{w} \Vert^2_2$$ such that for each $i$ indexing the dataset $$y_i \mathbf{w}^\top \mathbf{x}_i \ge 1 - \xi_i $$

If $\mathbf{x}_i$ is a $B$-hot vector with a positive label ($y_i= 1$), then we get: $$\mathbf{w}_k \ge \frac{1- \xi_i}{B} \simeq 0$$ for each $k$ indexing the dimensions of $\mathbf{w}$. So, up to the slack variable $\xi_i$ which is most of the time zero or less than one, we have constrained the sign of $\mathbf{w}_k$. In practice, we can zero out the negative values of the libraries' output.

This slight modification of regular SVMs also works for logistic regression (because the hinge and logistic losses have almost the same behavior) and is easy for sparse representation as $B$-hot vectors are ridiculously sparse.