Hashing Representations for Machine Learning

I've discovered that there are many 'hashing tricks' in machine learning. Some of these are like the count-min sketch in that they rely on an explicit bloom filter style datastructure. The ones here take a radical conceptual step: they only use one hashing function. This turns out to work out remarkably well when learning, because the learning algorithm can learn to deal with collisions. In several applications, it allows radical compressions of the size of the learned model. The closest similar prior work we've found is in the CRM114 spam filter which generates several features for each n-gram of words.

This hashing trick is implemented in Vowpal Wabbit as the core representation, as well as Torch and Stream, part of Elefant. It is particularly useful for online learning algorithms consuming large amounts of data.

Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and SVN Vishwanathan, Hash Kernels for Structured Data, AISTAT 2009 and JMLR 2009.

This paper defines a 'hash kernel', shows that it can be handy for several problems, and makes an observation that hashing can work well asymptotically due to internal feature redundancy.

Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, Josh Attenberg, Feature Hashing for Large Scale Multitask Learning, ICML 2009 Warning: Theorem 3 in this paper appears to be false. We'll be updating the draft soon.

Here, we provide a large scale application of the hashing trick to mass personalized spam filtering, define an unbiased hash, and prove that the unbiased hash is length preserving when the magnitudes of individual elements are not large. This implies that hashing can be thought of as a sparsity preserving projection.

See also Small Statistical Models by Random Feature Mixing by Kuzman Ganchev and Mark Dredze which shows this approach can work for NLP applications.