The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down).
A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n0.5, essentially because there are m2 pairs. This is pretty bad, because it says that with a vocabulary of 105 features, you might need to have 1010 entries in your table.
It turns out that redundancy is great for dealing with collisions. Alex and I worked out a couple cases, the most extreme example of which is when you simply duplicate the base word and add a symbol before hashing, creating two entries in your weight array corresponding to the same word. We can ask: what is the probability(*) that there exists a word where both entries collide with an entry for some other word? Answer: about 4m3/n2. Plugging in numbers, we see that this implies perhaps only n=108 entries are required to avoid a collision. This number can be further reduced to 107 by increasing the degree of duplication to 4 or more.
The above is an analysis of explicit duplication. In a real world dataset with naturally redundant features, you can have the same effect implicitly, allowing for tolerance of a large number of collisions.
This argument is information theoretic, so it’s possible that rates of convergence to optimal predictors are slowed by collision, even if the optimal predictor is unchanged. To think about this possibility, analysis particular to specific learning algorithms is necessary. It turns out that many learning algorithms are inherently tolerant of a small fraction of collisions, including large margin algorithms.
(*) As in almost all hash function analysis, the randomization is over the choice of (random) hash function.