Information Theory – Machine Learning (Theory)

2/2/20112/2/2011

User preferences for search engines

I want to comment on the “Bing copies Google” discussion here, here, and here, because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me.

As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo!, which has started using Bing. This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand, I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning. Also note, this post is not related to official Yahoo! policy, it’s just my personal view.

The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later noticed some synthetic responses from Bing with the synthetic queries.

There are two kinds of disagreement which people might have with this.

One is the privacy disagreement “Big Brother Microsoft is looking at what I search and using it”. I’m sympathetic on this count, but also sympathetic to the counter argument, that the data collected has value and can enhance the results for all users. In the end, I think companies should simply do their best to accept a user’s wishes, so those who want privacy can have it, and those who want to contribute their data towards improving a search engine can do so. The precise manner for achieving this by opt-in, opt-out, differential privacy, anonymization or other techniques is not entirely clear to me.

Let’s assume the privacy issue is dealt with. This is at least partly and possibly grossly untrue, but I want to focus on the other issue, and this assumption simplifies it’s discussion because a user and their internet browser are synonymous when the privacy issue is dealt with, as the agent’s actions are a true reflection of the user’s preferences.

The other issue is an originality disagreement, which much of the discussion focuses on. What I believe happened was a user feedback process, where users queried Google, clicked on a result, informed Microsoft/Bing of the query and clicked result, and their preference was used to promote the search result within Bing. Now, there is a slippery-slope of questions. Should a user be allowed to:

Reveal to their chosen search engine their preferred result?
Reveal to a competitor’s search engine their preferred result?

If you answer ‘no’ to the first, you are deeply against user freedom in a manner I can’t sympathize with. If you answer ‘yes’ to the first, and ‘no’ to the second, then you are still somewhat against user freedom. This isn’t too crazy a stance, as various people sell information and require of their users that it not be retransmitted. One of the more famous examples of this is the Bloomberg Terminal. However, in all instances I’m aware of, users knowingly agree to a contract providing access to the information with limitations. Google never entered into such a contract with it’s users, and I don’t know a sound basis for even an implicit contract. So, my answer are “yes, and yes” here.

But this doesn’t entirely deal with the issue of originality. You could argue that it’s ok for Microsoft to take advantage of revealed user interaction, but it’s still a matter of following rather than leading. This argument is simplistic and wrong, as I expect all informed parties already understand. A basic truth seen in many ways, is that the proper incorporation of new sources of information always improves results. This is true in machine learning where sample complexity results and cotraining formalize mechanisms and values of incorporating additional information, and it was heavily used by all competitive teams in the Netflix Competition. More generally, it’s true in basic knowledge engineering, where people fuse sources of information to create a better system, and I’m virtually certain it’s true of the ranking algorithms behind Google and Bing, which are surely complex beasts taking into account many sources of information. I know no details about the algorithm which Microsoft is using, but it’s quite plausible that they incorporated this information well enough to improve the quality of their results, perhaps in some instances so they are better than Google’s or the earlier version of Bing’s. If that’s the case, Google will either follow Microsoft’s lead taking into account user feedback as Microsoft does, or risk becoming obsolete.

We can also think about things in terms of the future. A basic truth, is that building a successful search engine is extraordinarily difficult. This is revealed by search market share, but also by simply thinking about the logistics involved. You need to crawl the web, have server farms all over the world (because the speed of light just isn’t fast enough), and incorporate many sources of information in just the right way in order to succeed, all while adversaries try to corrupt your results. If we prefer a future where there is a healthy competition amongst search engines, then it’s important to lower these barriers to entry so new people with new ideas can more easily test them out. One way to lower the barrier to entry is to accept that users can share their interaction, even with a competitor’s search engine.

Perhaps it’s inevitable that Amit Singhal has a viewpoint driving towards a monopoly on internet search. However, Google has generally been relatively good about supporting a rich ecosystem of innovation for information technology development, so I am still somewhat surprised. I would be more sympathetic to a position for allowing users of Internet Explorer a built-in means to choose to share their search behavior with Google or other search engines on an equal footing.

6/16/20066/17/2006

Regularization = Robustness

The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,

argmax_p H(p)
s.t. E_p[f_i] = c_i

is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).

A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the Ã¢â‚¬Å“geometricÃ¢â‚¬Â version here.

In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. ItÃ¢â‚¬â„¢s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin WainwrightÃ¢â‚¬â„¢s approximate inference techniques rely (in essence) on this result.)

In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a Ã¢â‚¬Å“priorÃ¢â‚¬Â that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and itÃ¢â‚¬â„¢s a sleight of hand because the penalties donÃ¢â‚¬â„¢t come from the motivating problem (the primal) but rather get tacked on at the end. In short: itÃ¢â‚¬â„¢s a hack.

So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that IÃ¢â‚¬â„¢m not even close to having been the first to figure that out. Miroslav DudÃƒÂk, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. ItÃ¢â‚¬â„¢s a great paper and I hope you get a chance to check it out:

Performance guarantees for regularized maximum entropy density estimation.

(Even better: if youÃ¢â‚¬â„¢re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)

It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework youÃ¢â‚¬â„¢re in luck. Yasemin Altun and Alex Smola have a paper (that I havenÃ¢â‚¬â„¢t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If youÃ¢â‚¬â„¢re at COLT, you can check this out as well.

Unifying Divergence Minimization and Statistical Inference via Convex Duality

The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.

1/23/20061/23/2006

On Coding via Mutual Information & Bayes Nets

Say we have two random variables X,Y with mutual information I(X,Y). Let’s say we want to represent them with a bayes net of the form X< -M->Y, such that the entropy of M equals the mutual information, i.e. H(M)=I(X,Y). Intuitively, we would like our hidden state to be as simple as possible (entropy wise). The data processing inequality means that H(M)>=I(X,Y), so the mutual information is a lower bound on how simple the M could be. Furthermore, if such a construction existed it would have a nice coding interpretation — one could jointly code X and Y by first coding the mutual information, then coding X with this mutual info (without Y) and coding Y with this mutual info (without X).

It turns out that such a construction does not exist in general (Thx Alina Beygelzimer for a counterexample! see below for the sketch).

What are the implications of this? Well, it’s hard for me to say, but it does suggest to me that the ‘generative’ model philosophy might be burdened with a harder modeling task. If all we care about is a information theoretic, compact hidden state, then constructing an accurate Bayes net might be harder, due to the fact that it takes more bits to specify the distribution of the hidden state. In fact, since we usually condition on the data, it seems odd that we should bother specifying a (potentially more complex) generative model. What are the alternatives? The information bottleneck seems interesting, though this has peculiarities of its own.

Alina’s counterexample:

Here is the joint distribution P(X,Y). Sample binary X from an unbiased coin. Now choose Y to be the OR function of X and some other ‘hidden’ random bit (uniform). So the joint is:

P(0,0)=1/4
P(0,1)=1/4
P(1,0)=0
P(1,1)=1/2

Note P(X=1)=1/2 and P(Y=1)=3/4. Here,

I(X,Y)= 3/4 log (4/3) ~= 0.31

The rest of the proof showing that this is not achievable in a ‘compact’ Bayes net is in a comment.

10/16/200510/19/2005

Complexity: It’s all in your head

One of the central concerns of learning is to understand and to
prevent overfitting. Various notion of “function complexity” often
arise: VC dimension, Rademacher complexity, comparison classes of
experts, and program length are just a few.

The term “complexity” to me seems somehow misleading; the terms never
capture something that meets my intuitive notion of complexity. The
Bayesian notion clearly captures what’s going on. Functions aren’t
“complex”– they’re just “surprising”: we assign to them low
probability. Most (all?) complexity notions I know boil down
to some (generally loose) bound on the prior probability of the function.

In a sense, “complexity” fundementally arises because probability
distributions must sum to one. You can’t believe in all possibilities
at the same time, or at least not equally. Rather you have to
carefully spread the probability mass over the options you’d like to
consider. Large complexity classes means that beliefs are spread
thinly. In it’s simplest form, this phenomenom give the log (1\n) for
n hypotheses in classic PAC bounds.

In fact, one way to think about good learning algorithms is that they
are those which take full advantage of their probability mass.
In the language of Minimum Description Length, they correspond to
“non-defective distributions”.

So this raises a question: are there notions of complexity (preferably finite,
computable ones) that differ fundementally from the notions of “prior”
or “surprisingness”? Game-theoretic setups would seem to be promising,
although much of the work I’m familiar with ties it closely to the notion
of prior as well.