Machine Learning (Theory)

10/26/2012

ML Symposium and Strata/Hadoop World

Tags: Conferences,Workshop jl@ 11:40 am

The New York ML symposium was last Friday. There were 303 registrations, up a bit from last year. I particularly enjoyed talks by Bill Freeman on vision and ML, Jon Lenchner on strategy in Jeopardy, and Tara N. Sainath and Brian Kingsbury on deep learning for speech recognition. If anyone has suggestions or thoughts for next year, please speak up.

I also attended Strata + Hadoop World for the first time. This is primarily a trade conference rather than an academic conference, but I found it pretty interesting as a first time attendee. This is ground zero for the Big data buzzword, and I see now why. It’s about data, and the word “big” is so ambiguous that everyone can lay claim to it. There were essentially zero academic talks. Instead, the focus was on war stories, product announcements, and education. The general level of education is much lower—explaining Machine Learning to the SQL educated is the primary operating point. Nevertheless that’s happening, and the fact that machine learning is considered a necessary technology for industry is a giant step for the field. Over time, I expect the industrial side of Machine Learning to grow, and perhaps surpass the academic side, in the same sense as has already occurred for chip design. Amongst the talks I could catch, I particularly liked the Github, Zillow, and Pandas talks. Ted Dunning also gave a particularly masterful talk, although I have doubts about the core Bayesian Bandit approach(*). The streaming k-means algorithm they implemented does look quite handy.

(*) The doubt is the following: prior elicitation is generally hard, and Bayesian techniques are not robust to misspecification. This matters in standard supervised settings, but it may matter more in exploration settings where misspecification can imply data starvation.

5 Comments to “ML Symposium and Strata/Hadoop World”
  1. Ted Dunning says:

    Aw shucks, John. I hate to dispute anything you say after this.

    BUT …

    The Bayesian Bandit term is a unifying sort of code name for presenting Thompson sampling (randomized probability matching) to a lay audience. Thompson sampling is well established in the literature both empirically and with fairly tight theoretical bounds relative to optimal regret. I have been able to replicate the empirical results quite easily.

    The Bayesian methods used are simply an easy way to sample from a distribution defined by an integral equivalent to the Bayesian posterior. This admits a rather clever solution that avoids the need to evaluate the integral directly. The prior distributions used are typically quite weak and are easily shown to have very little effect on the cumulative regret when even slightly plausible priors are used. This is hardly surprising since the priors in question are typically equivalent to only a few trials and thus a well-tuned prior is only likely to accelerate convergence by a trivial amount.

    As such, questions of the difficulty of prior elicitation are pretty much moot since it doesn’t much matter, at least in direct bandit problems.

    The main claims I make are:

    a) Thompson sampling has a bound on cumulative regret quite close to theoretical optimum

    b) simulations demonstrate that this bound is commonly achieved in artificial settings in a fashion relatively insensitive to the prior

    c) parametrized formulations allow contextual bandits to be solved using essentially the same algorithm

    d) actual deployments show results quite similar to that expected from simulation which also exceed the performance of competitive systems

    The arguments for (a) are relatively straightforward. I have verified (b) and (c). Microsoft researchers Graepel, Candela, Borchert and Herbrich have published empirical verification of (d) with the AdPredictor paper.

    What is left in dispute?

  2. jl says:

    Reasoning about misspecification is tricky, but it’s easy to concoct a theorem statement of the form: There exists P,P’ where P = a prior distribution over the way the world behaves and P’ = an elicited distribution such that regret scales as max_h P(h)/P’(h) with significant probability. You can even do this in the bandit setting—just make the prior on arm A a factor of 1000 less than the prior on arm B and notice that you need to sample ~1000 times before you get any information about A.

    Using the obvious uniform priors and a bit of common sense avoids this problem for bandits.

    In contextual bandit problems, this becomes much less clear, because failure to elicit a reasonable prior is almost unavoidable. The only question is the degree of misspecification, and a factor of 1000 is quite plausible.

    There are other algorithms with better worst-case behavior out there. I’d point out in particular EXP4(which works in a fully adversarial settings) and Policy Elimination (which works in an IID setting).

    The standard counter to optimizing worst-case behavior is that you care about your case rather than the worst case. That’s fair, but it should not be used to argue for a method that could be radically worse due to misconfidence. The fundamental issue is that truly specifying a reasonable prior in a high dimensional world (and then computing on it) is often quite hard both in skill of the specifier and cycles on the computer.

    W.r.t
    (a) It’s usually possible to make a positive statement about an algorithm with a sufficiently powerful ‘if’ clause, but the power of the ‘if’ clause merits very careful consideration. If you point out a particular paper, I’ll take a look.
    (b) You flashed one slide which had a prior that looked like: |___|___| for a coin that was either fair, all heads, or all tails. If you plug this prior into your simulation when the coins are in fact .25 or .75, you’ll end up with significant regret. It’s correct to say “insensitive to deliberately robust priors”.
    (c) To some extent yes, but again the robustness of the solution becomes a greater issue.
    (d) The primary claim in the AdPredictor papers seems to be that it works better than Naive Bayes. Discussions of Thompson Sampling are done, but no results evaluated.

    My general opinion of Thompson Sampling at the moment is that it’s something to try which may work well, but which has some obvious robustness drawbacks.

  3. Shipra Agrawal says:

    The claim (a) by Dunning is indeed provable without need of any special conditions. Please refer to the following recent papers. These papers show that THompson Sampling has provably optimal regret for MAB problem (near optimal for contextual bandits).

    http://arxiv.org/abs/1209.3353
    http://arxiv.org/abs/1209.3352
    http://arxiv.org/abs/1205.4217
    http://arxiv.org/abs/1111.1797

  4. jl says:

    Thanks Shipra. You’ve done quite a bit of interesting work here. There are significant differences in what is considered a special condition.

    1209.3353 assumes MAB. (i.e. no context)
    1209.3352 assumes linearity in the rewards.
    1205.4217 assumes MAB.
    1111.1797 assumes MAB.

    MAB seems like a special condition to me, because it is rare in practice to have no relevant contextual information. Linearity is broad enough to be useful, but it’s also often not satisfied. Both MAB and Linear MAB are significantly more special EXP4 or the Policy Elimination results.

  5. Sanjay Chawla says:

    The “elephant in the room” for practical (and perhaps theoretical) machine learning is “feature engineering”. Maybe that should be the topic of the ML Symposium. This year (2012) I had the good fortune to spend a large amount of time working on a very large scale and very high-dimensional machine learning problem and the bottleneck was always how to generate features from a limited set of attributes which will improve classification accuracy. Existing theory is practically of little use. It will be great if some machine learning theorist can guide a practitioner when to stop along the following lines.

    Thm: If the data set D with attribute set A has property P then with high-probability no amount of feature engineering will result in an improvement of the F(A) –> {0,1} classification problem.

    A simple case where the above “theorem” will/may hold true is when A has a diagonal covariance matrix.

Leave a Reply


Powered by WordPress