Eliminating the Birthday Paradox for Universal Features

I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text.

The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down).

A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n0.5, essentially because there are m2 pairs. This is pretty bad, because it says that with a vocabulary of 105 features, you might need to have 1010 entries in your table.

It turns out that redundancy is great for dealing with collisions. Alex and I worked out a couple cases, the most extreme example of which is when you simply duplicate the base word and add a symbol before hashing, creating two entries in your weight array corresponding to the same word. We can ask: what is the probability(*) that there exists a word where both entries collide with an entry for some other word? Answer: about 4m3/n2. Plugging in numbers, we see that this implies perhaps only n=108 entries are required to avoid a collision. This number can be further reduced to 107 by increasing the degree of duplication to 4 or more.

The above is an analysis of explicit duplication. In a real world dataset with naturally redundant features, you can have the same effect implicitly, allowing for tolerance of a large number of collisions.

This argument is information theoretic, so it’s possible that rates of convergence to optimal predictors are slowed by collision, even if the optimal predictor is unchanged. To think about this possibility, analysis particular to specific learning algorithms is necessary. It turns out that many learning algorithms are inherently tolerant of a small fraction of collisions, including large margin algorithms.

(*) As in almost all hash function analysis, the randomization is over the choice of (random) hash function.

Taking the next step

At the last ICML, Tom Dietterich asked me to look into systems for commenting on papers. I’ve been slow getting to this, but it’s relevant now.

The essential observation is that we now have many tools for online collaboration, but they are not yet much used in academic research. If we can find the right way to use them, then perhaps great things might happen, with extra kudos to the first conference that manages to really create an online community. Various conferences have been poking at this. For example, UAI has setup a wiki, COLT has started using Joomla, with some dynamic content, and AAAI has been setting up a “student blog“. Similarly, Dinoj Surendran setup a twiki for the Chicago Machine Learning Summer School, which was quite useful for coordinating events and other things.

I believe the most important thing is a willingness to experiment. A good place to start seems to be enhancing existing conference websites. For example, the ICML 2007 papers page is basically only useful via grep. A much more human-readable version of the page would organize the papers by topic. If the page wiki-editable, this would almost happen automatically. Adding the ability for people to comment on the papers might make the website more useful beyond the time of the conference itself.

There are several aspects of an experiment which seem intuitively important to me. I found the wikipatterns site a helpful distillation of many of these intuitions. Here are various concerns I have:

  1. Mandate An official mandate is a must-have. Any such enhancement needs to be an official part of the website, or the hesitation to participate will probably be too much.
  2. Permissive Comments Allowing anyone to comment on a website is somewhat scary to academics, because we are used to peer-reviewing papers before publishing. Nevertheless, it seems important to not strongly filter comments, because:
    1. The added (human) work of filtering is burdensome.
    2. The delay introduced acts as a barrier to participation.

    The policy I’ve followed on hunch.net is allowing comments from anyone exhibiting evidence of intelligence—i.e. filtering essentially only robots. This worked as well I hoped, and not as badly as I feared.

  3. Spam Spam is a serious issue for dynamic websites, because it adds substantially to the maintenance load. There are basically two tacks to take here:
    1. Issue a userid/passwd to every conference registrant (and maybe others that request it), the just allow comments from them.
    2. Allow comments from anyone, but use automated filters. I’ve been using Akismet, but recaptcha is also cool.

    I favor the second approach, because it’s more permissive, and it makes participation easier. However, it may increase the maintenance workload.

  4. Someone Someone to shepard the experiment is needed. I’m personally overloaded with other things at the moment (witness the slow post rate), so I don’t have significant time to devote. Nevertheless, I’m sure there are many people in the community with as good a familiarity with the internet and web applications as myself.
  5. Software Choice I don’t have strong preferences for the precise choice of software, but some guidelines seem good.
    1. Open Source I have a strong preference for open source solutions, of which there appear to be several reasonable choices. The reason is that open source applications leave you free (or at least freer) to switch and change things, which seems essential when experimenting.
    2. Large User base When going with an open source solution, something with a large user base is likely to have fewer rough edges.

    I have some preference for systems using flat files for datastorage rather than a database because they are easier to maintain or (if necessary) operate on. This is partly due to a bad experience I had with the twiki setup for MLSS—basically an attempt to transfer data to an upgraded mysql failed because of schema issues I failed to resolve.

    I’m sure there are many with more experience using wiki and comment systems—perhaps they can comment on exact software choices. Wikimatrix seems to provide frighteningly detailed comparisons of different wiki software.

The Science 2.0 article

I found the article about science using modern tools interesting, especially the part about ‘blogophobia’, which in my experience is often a substantial issue: many potential guest posters aren’t quite ready, because of the fear of a permanent public mistake, because it is particularly hard to write about the unknown (the essence of research), and because the system for public credit doesn’t yet really handle blog posts.

So far, science has been relatively resistant to discussing research on blogs. Some things need to change to get there. Public tolerance of the occasional mistake is essential, as is a willingness to cite (and credit) blogs as freely as papers.

I’ve often run into another reason for holding back myself: I don’t want to overtalk my own research. Nevertheless, I’m slowly changing to the opinion that I’m holding back too much: the real power of a blog in research is that it can be used to confer with many people, and that just makes research work better.

Blog compromised

Iain noticed that hunch.net had zero width divs hiding spammy URLs. Some investigation reveals that the wordpress version being used (2.0.3) had security flaws. I’ve upgraded to the latest, rotated passwords, and removed the spammy URLs. I don’t believe any content was lost. You can check your own and other sites for a similar problem by greping for “width:0” or “width: 0” in the delivered html source.

It Doesn’t Stop

I’ve enjoyed the Terminator movies and show. Neglecting the whacky aspects (time travel and associated paradoxes), there is an enduring topic of discussion: how do people deal with intelligent machines (and vice versa)?

In Terminator-land, the primary method for dealing with intelligent machines is to prevent them from being made. This approach works pretty badly, because a new angle on building an intelligent machine keeps coming up. This is partly a ploy for writer’s to avoid writing themselves out of a job, but there is a fundamental truth to it as well: preventing progress in research is hard.

The United States, has been experimenting with trying to stop research on stem cells. It hasn’t worked very well—the net effect has been retarding research programs a bit, and exporting some research to other countries. Another less recent example was encryption technology, for which the United States generally did not encourage early public research and even discouraged as a munition. This slowed the development of encryption tools, but I now routinely use tools such as ssh and GPG.

Although the strategy of preventing research may be doomed, it does bring up a Bill Joy type of question: should we actively chose to do research in a field where knowledge can be used to great harm? As an example, the Terminator series illustrates the dark fears of AI gone bad. Many researchers avoid this question by not thinking about it, but this is a substantial question of concern to society at large, and whether or not society supports a direction of research.

My answer is “yes, we should do research”. The reason is simple: I believe that good AI is the best chance of the survival of civilization. This might seem like a leap, but considering the following.

  1. Civilization is not stable. Anyone who believes otherwise needs to try to smell the 1908. Just a lifetime ago, humans could barely fly and computers were people. These radical changes in the abilities of a civilization are strong evidence against stability. Further evidence of instabilities come from long term world changing trends such as greenhouse gas accumulation and population graphs.
  2. Instability is bad in the long run. There are quite a number of doomsday-for-civilization scenarios kicking around—nuclear, plague, grey goo, black holes, etc… Many people find doomsday scenarios triggered by malevolence or accident to be unconvincing, since doomsday claims are so commonly debunked (remember the Y2K computer bug armageddon?). I am naturally skeptical myself, but it only takes one. In the next 10000 years, the odds of something going wrong seem fair.
  3. … for a closed system. There is one really good caveat to instability, which is redundancy. Perhaps if we Earthlings screwup, our descendendents on Alpha Centauri can come pick up the pieces. The fundamental driver here is light speed latency: if it takes years for two groups to communicate, then it is unlikely they’ll manage to coordinate (with malevolence or accident) a simultaneous doomsday.
  4. But real space travel requires AI. Getting from one star system to another with known physics turns out to be very hard. The best approaches I know involve giant lasers and multiple solar sails or fusion powered rockets, taking many years. Merely getting there, of course, is not enough—we need to get there with a kernel of civilization, capable of growing anew in the new system. Any adjacent star system may not have an earth-like planet implying the need to support a space-based civilization. Since travel between star systems is so prohibitively difficult, a basic question is: how small can we make a kernel of civilization? Many science fiction writers and readers think of generation ships, which would necessarily be enormous to support the air, food, and water requirements of a self-sustaining human population. A much simpler and easier solution comes with AI. A good design might mass 103 kilograms or so and be designed to first land on an asteroid, then mine it, first creating a large solar cell array, and replicas to seed other asteroids. Eventually, hallowed out asteroids could support human life if the requisite materials (Oxygen, Carbon, Hydrogen, etc..) are found. The fundamental observation here is that intelligence and knowledge require very little mass.

I hope we manage to crack AI, opening the door to real space travel, so that civilization doesn’t stop.