John Langford – Page 45 – Machine Learning (Theory)

5/25/2008

Inappropriate Mathematics for Machine Learning

Reviewers and students are sometimes greatly concerned by the distinction between:

An open set and a closed set.
A Supremum and a Maximum.
An event which happens with probability 1 and an event that always happens.

I don’t appreciate this distinction in machine learning & learning theory. All machine learning takes place (by definition) on a machine where every parameter has finite precision. Consequently, every set is closed, a maximal element always exists, and probability 1 events always happen.

The fundamental issue here is that substantial parts of mathematics don’t appear well-matched to computation in the physical world, because the mathematics has concerns which are unphysical. This mismatched mathematics makes irrelevant distinctions. We can ask “what mathematics is appropriate to computation?” Andrej has convinced me that a pretty good answer to this question is constructive mathematics.

So, here’s a basic challenge: Can anyone name a situation where any of the distinctions above (or similar distinctions) matter in machine learning?

4/30/20085/22/2008

Concerns about the Large Scale Learning Challenge

The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured.

From the instructions page, several issues come up:

Large Definition My personal definition of dataset size is:
1. small A dataset is small if a human could look at the dataset and plausibly find a good solution.
2. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer.
3. large A large dataset does not fit in the RAM of a reasonably priced computer.
By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that.

The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require random access to examples.
No Loading Time The medium scale nature of the datasets is tacitly acknowledged in the rules which exclude data loading time. My experience is that parsing and loading large datasets is often the computational bottleneck. For example when comparing Vowpal Wabbit to SGD I used wall-clock time which makes SGD look a factor of 40 or so worse than Leon’s numbers only using training time after loading. This timing difference is entirely due to the overhead of parsing, even though the format parsed is a carefully optimized binary language. (No ‘excluding loading time’ number can be found for VW, of course, because loading and learning are intertwined.)
Optimal Parameter Time The rules specify that the algorithm should be timed with optimal parameters. It’s very common for learning algorithms to have a few parameters controlling learning rate or regularization. However, no constraints are placed on the number or meaning of these parameters. As an extreme form of abuse, for example, your initial classifier could be declared a parameter. With an appropriate choice of this initial parameter (which you can freely optimize on the data), training time is zero.
Parallelism One approach to dealing with large amounts of data is to add computers that operate in parallel. This is very natural (the brain is vastly parallel at the neuron level), and there are substantial research questions in parallel machine learning. Nevertheless it doesn’t appear to be supported by the contest. There are good reasons for this: parallel architectures aren’t very standard yet, and buying multiple computers is still substantially more expensive than buying the RAM to fit the dataset sizes. Nevertheless, it’s disappointing to exclude such a natural avenue. The rules even appear unclear on whether or not the final test run is on an SMP machine.

As a consequence of this design, the contest prefers algorithms that load all data into memory then operate on it. It also essentially excludes parallel algorithms. These design decisions discourage large scale algorithms (where large is as defined above) in favor of medium scale learning algorithms. The design also favors highly parameterized learning algorithms over less parameterized algorithms, which is the opposite of my personal preference for research direction.

Many of these issues are eliminatable or at least partially addressable. Limiting the parameter size to ’20 characters on the commandline’ or in some other reasonable way seems essential. It’s probably too late to get large datasets, but using wall-clock time would at least avoid bias against large scale algorithms. If the final evaluation is going to take place on an SMP machine, at least detailing that would be helpful.

Despite these concerns, it’s important to be clear that this is an interesting contest. Even without any rule changes, it’s outcome tells us something about which sorts of algorithms work at a medium scale. That’s good information to know if you are interested in tackling larger scale algorithms. The datasets are also large enough to break every Theta(m²) algorithm. We should also respect the organizers: setting up any contest of this sort is quite a bit of work that’s difficult to nail down perfectly in advance.

update: Soeren has helped setup an SMP parallel track which address some of the concerns above. See the site for details, and see you there.

4/27/2008

Watchword: Supervised Learning

I recently discovered that supervised learning is a controversial term. The two definitions are:

Known Loss Supervised learning corresponds to the situation where you have unlabeled examples plus knowledge of the loss of each possible predicted choice. This is the definition I’m familiar and comfortable with. One reason to prefer this definition is that the analysis of sample complexity for this class of learning problems are all pretty similar.
Any kind of signal Supervised learning corresponds to the situation where you have unlabeled examples plus any source of side information about what the right choice is. This notion of supervised learning seems to subsume reinforcement learning, which makes me uncomfortable, because it means there are two words for the same class. This also means there isn’t a convenient word to describe the first definition.

Reviews suggest there are people who are dedicated to the second definition out there, so it can be important to discriminate which you mean.

4/26/2008

Eliminating the Birthday Paradox for Universal Features

I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text.

The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down).

A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n^0.5, essentially because there are m² pairs. This is pretty bad, because it says that with a vocabulary of 10⁵ features, you might need to have 10¹⁰ entries in your table.

It turns out that redundancy is great for dealing with collisions. Alex and I worked out a couple cases, the most extreme example of which is when you simply duplicate the base word and add a symbol before hashing, creating two entries in your weight array corresponding to the same word. We can ask: what is the probability(*) that there exists a word where both entries collide with an entry for some other word? Answer: about 4m³/n². Plugging in numbers, we see that this implies perhaps only n=10⁸ entries are required to avoid a collision. This number can be further reduced to 10⁷ by increasing the degree of duplication to 4 or more.

The above is an analysis of explicit duplication. In a real world dataset with naturally redundant features, you can have the same effect implicitly, allowing for tolerance of a large number of collisions.

This argument is information theoretic, so it’s possible that rates of convergence to optimal predictors are slowed by collision, even if the optimal predictor is unchanged. To think about this possibility, analysis particular to specific learning algorithms is necessary. It turns out that many learning algorithms are inherently tolerant of a small fraction of collisions, including large margin algorithms.

(*) As in almost all hash function analysis, the randomization is over the choice of (random) hash function.

4/22/200810/27/2017

Taking the next step

At the last ICML, Tom Dietterich asked me to look into systems for commenting on papers. I’ve been slow getting to this, but it’s relevant now.

The essential observation is that we now have many tools for online collaboration, but they are not yet much used in academic research. If we can find the right way to use them, then perhaps great things might happen, with extra kudos to the first conference that manages to really create an online community. Various conferences have been poking at this. For example, UAI has setup a wiki, COLT has started using Joomla, with some dynamic content, and AAAI has been setting up a “student blog“. Similarly, Dinoj Surendran setup a twiki for the Chicago Machine Learning Summer School, which was quite useful for coordinating events and other things.

I believe the most important thing is a willingness to experiment. A good place to start seems to be enhancing existing conference websites. For example, the ICML 2007 papers page is basically only useful via grep. A much more human-readable version of the page would organize the papers by topic. If the page wiki-editable, this would almost happen automatically. Adding the ability for people to comment on the papers might make the website more useful beyond the time of the conference itself.

There are several aspects of an experiment which seem intuitively important to me. I found the wikipatterns site a helpful distillation of many of these intuitions. Here are various concerns I have:

Mandate An official mandate is a must-have. Any such enhancement needs to be an official part of the website, or the hesitation to participate will probably be too much.
Permissive Comments Allowing anyone to comment on a website is somewhat scary to academics, because we are used to peer-reviewing papers before publishing. Nevertheless, it seems important to not strongly filter comments, because:
1. The added (human) work of filtering is burdensome.
2. The delay introduced acts as a barrier to participation.
The policy I’ve followed on hunch.net is allowing comments from anyone exhibiting evidence of intelligence—i.e. filtering essentially only robots. This worked as well I hoped, and not as badly as I feared.
Spam Spam is a serious issue for dynamic websites, because it adds substantially to the maintenance load. There are basically two tacks to take here:
1. Issue a userid/passwd to every conference registrant (and maybe others that request it), the just allow comments from them.
2. Allow comments from anyone, but use automated filters. I’ve been using Akismet, but recaptcha is also cool.
I favor the second approach, because it’s more permissive, and it makes participation easier. However, it may increase the maintenance workload.
Someone Someone to shepard the experiment is needed. I’m personally overloaded with other things at the moment (witness the slow post rate), so I don’t have significant time to devote. Nevertheless, I’m sure there are many people in the community with as good a familiarity with the internet and web applications as myself.
Software Choice I don’t have strong preferences for the precise choice of software, but some guidelines seem good.
1. Open Source I have a strong preference for open source solutions, of which there appear to be several reasonable choices. The reason is that open source applications leave you free (or at least freer) to switch and change things, which seems essential when experimenting.
2. Large User base When going with an open source solution, something with a large user base is likely to have fewer rough edges.
I have some preference for systems using flat files for datastorage rather than a database because they are easier to maintain or (if necessary) operate on. This is partly due to a bad experience I had with the twiki setup for MLSS—basically an attempt to transfer data to an upgraded mysql failed because of schema issues I failed to resolve.

I’m sure there are many with more experience using wiki and comment systems—perhaps they can comment on exact software choices. Wikimatrix seems to provide frighteningly detailed comparisons of different wiki software.