I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text.
The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down).
A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n0.5, essentially because there are m2 pairs. This is pretty bad, because it says that with a vocabulary of 105 features, you might need to have 1010 entries in your table.
It turns out that redundancy is great for dealing with collisions. Alex and I worked out a couple cases, the most extreme example of which is when you simply duplicate the base word and add a symbol before hashing, creating two entries in your weight array corresponding to the same word. We can ask: what is the probability(*) that there exists a word where both entries collide with an entry for some other word? Answer: about 4m3/n2. Plugging in numbers, we see that this implies perhaps only n=108 entries are required to avoid a collision. This number can be further reduced to 107 by increasing the degree of duplication to 4 or more.
The above is an analysis of explicit duplication. In a real world dataset with naturally redundant features, you can have the same effect implicitly, allowing for tolerance of a large number of collisions.
This argument is information theoretic, so it’s possible that rates of convergence to optimal predictors are slowed by collision, even if the optimal predictor is unchanged. To think about this possibility, analysis particular to specific learning algorithms is necessary. It turns out that many learning algorithms are inherently tolerant of a small fraction of collisions, including large margin algorithms.
(*) As in almost all hash function analysis, the randomization is over the choice of (random) hash function.
It sounds as if you’re using a Bloom filter or Bloomier filter. Is there a difference?
Bloom filters are typically used for set membership queries. The idea of a bloom filter of representing an element of a set by several randomly indexed entries is used. The exact application is a bit different because set membership isn’t the problem so each entry is a float rather than a bit, and we rely on the learning algorithm to deal appropriately with conflicted collisions by reducing the weight to near-0.
Take a look at double hashing: http://en.wikipedia.org/wiki/Double_hashing
The short take is that you can generate n hashes using linear combinations of two original (independent) hashes. Your two hashes of the original string with different salt appended should be independent enough if you use a good hash function.
Anyway, appending these multiple hash functions together can get you to as low a collision rate as you could want.
You should also keep in mind that hash tables in Java are about as time efficient as arrays. This is because the collision rate is kept relatively low and because Java strings all have the hash pre-computed. This means that accessing a hash table consists (except rarely) in doing a modulo reduction, an array access and a hash comparison to check that we pulled the right value. The resulting speed is quite impressive. Memory use is larger than an array, but that is often no problem in this world of cheap memory.
My understanding is that these techniques would be substantially slower than the approach outlined above. It’s hard to beat the speed of a direct array access.
Hash tables involve several more operations than direct array access and are inherently slower. This might not affect a larger algorithm much if the hashing isn’t the bottleneck, but for linear-basis classifiers like John’s describing (e.g. SVMs, logistic regression), feature extraction is the slowest operation. By computing only hash codes, you don’t even need to represent the feature vector in memory; you just look up the basis vector value for the hash code and add it to the running total.
Hash codes for strings get computed once the first time they’re used (assuming single threading — they rely on atomic integer assignment in the Java memory model for synch, so may be computed multiple times with multiple threads).
Also, you can’t append hashes in Java — they’re only 32 bit integers.
John assumed random hashing in the body of his post, but it’s easy to see that with 400 objects (e.g. people) and a hash table of size 365 (e.g. birthdays), there’s going to be a collision, no matter how random the the hash function is.
I’m curious what having two codes for a word will do to the classification accuracy. It will increase the possibility that two features have a collision on one of their two codes. It winds up being a kind of holographic representation; I’m guessing you might be able to more accurately reconstruct the underlying vector than with single codes per feature. But what’ll it do to the trained coefficients? You might collide on one code with one feature and another code with another feature.
In practice for natural language, feature collision tends not to be too big a problem because the features do tend to be redundant. We’re now working on the 2008 i2b2 Obesity Challenge data and there you have patient records with only a few discriminitive words among thousands determining whether the patient has hypertension, obesity, diabetes, etc. It’s totally different than language ID, which is the most redundant classification problem we’ve come across, and for which generative models work very well.
Character n-grams (5-12 grams or so in English) tend to produce more accurate classifiers than tokens, and require much less human tuning. That increases the feature space and the overall redundancy of features.
it’s easy to see that with 400 objects (e.g. people) and a hash table of size 365 (e.g. birthdays), there’s going to be a collision, no matter how random the the hash function is.
The math shows what happens when you have more objects than indicies: the power on m is always greater than the power on n, regardless of the degree of duplication.
Mark Dredze and I have a position paper at the Mobile NLP workshop at this year’s ACL with a set of experiments on 13 binary tasks from 4 domains with 4 learning algorithms. The upshot is that for NLP, if m=n (i.e. our hash function maps to the same number of buckets as we have features) we can save about 78% of the space without sacrificing more than a relative 2% performance. The paper is:
Kuzman Ganchev and Mark Dredze. Small Statistical Models by Random Feature Mixing.
In Proceedings of the ACL-2008 Workshop on Mobile Language
Processing. ACL, 2008.
Doing this faster: use a 64-bit hash (eg. the one from Bob Jenkins, called lookup8 I believe), split it into two 32-bit hashes (assuming your hash has less than 2^32 entries), and use those. (or use two 32-bit hashes, if you are on a 32-bit machine)
Another method I have used is to split the 64-bit hash, use one 32-bit hash to calculate the table position, and store the other 32-bit hash in the table and compare it when looking up entries. This performs very well when you do not want to store the actual keys in the table and eliminates most/all false positives (= value returned when the key is actually not in the table), and has an overhead of 4 bytes per value. It might be not as useful when storing just weights, though.
Re: hash vs array: For any sort of large data set, a substantial performance hit comes from cache misses due to random access, and in this case, the difference between a hash and an array is smaller. A recent invention called d-ary cuckoo hashing allows hashes to tolerate high load factors and reduces cache misses caused by probing/chaining. If you need the last ounce of performance, look into those. (using those in Java might not be much of an advantage though, because you need manual control over memory layout to get the best effect, C/C++ is the king here, and .NET structs might do the trick)