Small Data

6 April 2019

In 2012, I was privileged enough to attend to Divide-and-Conquer and Statistical Inference in Big Data given in French by Michael I. Jordan who is a prominent figure in Machine Learning. That visionary talk begun by establishing what can still be considered at the crux of Data Science today, especially back in a time when neither Hadoop nor Spark were popularized:

Consider $N$ the number of elements of the database (cardinality) and $D$ the dimensionality of each element.

  • Big Data regime $\frac{N}{D} \gg 1$: Developper's nightmare especially because data may not fit in memory but if the computations are doable, then statistics background give reasons for the systems to work very well;

  • Small Data regime $\frac{N}{D} \ll 1$ or $\frac{N}{D} \simeq 1$: Statistician's nightmare especially because of the mathematical curse of dimensionality but the computations are easier than previously.

NB: This has not been said that way by Michael I. Jordan but only simplified by myself

Since then, data storage problems have been nicely solved at least thrice thanks to the decrease of hardware's costs, increase of data access speed and software solutions like HDFS and Spark (for both distributed file and in-memory systems) on the developper's side but the curse of dimensionality remains on the statistician's side. Computations have also been improved but not sufficiently energy-wise (that is beyond that current blog post scope).

Enforcing data structure seems to be an efficient way to cope with the curse of dimensionality issue. For example, if one considers the huge progress accomplished with images from the 1990s to the 2010s, then one sees that methods with hierarchical convolutions have been the key to unlock good classification results at the historical ImageNet competition ($\sim 10^6$ images database with $\sim 10^6$ pixels each thus $\frac{N}{D} \simeq 1$ which corresponds to a Small Data regime). Likewise, words ordering turns out to be the key to Natural Language Processing representations but despite worldwide Research attention in Genetics, Machine Learning tends to be uneffective yet: $1$ human being $D \simeq 10^{12}$ nucleotides and databases are limited to the worldwide population ($\sim 10^9$), thus $\frac{N}{D} \ll 1$ which corresponds to a Small Data regime once again. We did not yet leverage any desirable but unknown structure among those nucleotides to get an easier Big Data regime.

The take-home-message of this post is:

Indeed, structure translates data knowledge and human expertise into an understandable statistical language that is more suitable for a computer. By the way, if $N$ is not large ($\sim 10^1$ or $\sim 10^2$) then statistical estimators do not have much relevance although people like Sylvain ARLOT try to build a Mathematical theory and thus new algorithms in Small Data regime but it is hard.