Sanjoy – Machine Learning (Theory)

10/8/2010

An easy proof of the Chernoff-Hoeffding bound

Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. Here’s an easier way.

For $latex p,q \in (0,1)$, let $latex K(p,q)$ be the KL divergence between a coin of bias $latex p$ and one of bias $latex q$: $latex K(p,q) = p \ln \frac{p}{q} + (1-p) \ln \frac{1-p}{1-q}.$

Theorem: Suppose you do $latex n$ independent tosses of a coin of bias $latex p$. The probability of seeing $latex qn$ heads or more, for $latex q > p$, is at most $latex \exp(-nK(q,p))$. So is the probability of seeing $latex qn$ heads or less, for $latex q < p$.

Remark: By Pinsker’s inequality, $latex K(q,p) \geq 2(p-q)^2$.

Proof Let’s do the $latex q > p$ case; the other is identical.

Let $latex \theta_p$ be the distribution over $latex \{0,1\}^n$ induced by a coin of bias $latex p$, and likewise $latex \theta_q$ for a coin of bias $latex q$. Let $latex S$ be the set of all sequences of $latex n$ tosses which contain $latex qn$ heads or more. We’d like to show that $latex S$ is unlikely under $latex \theta_p$.

Pick any $latex \bar{x} \in S$, with say $latex k \geq qn$ heads. Then:
[latex size=”2″] \frac{\theta_q(\bar{x})}{\theta_p(\bar{x})} = \frac{q^k(1-q)^{n-k}}{p^k(1-p)^{n-k}} \geq \frac{q^{qn}(1-q)^{n-qn}}{p^{qn}(1-p)^{n-qn}} = \left( \frac{q}{p} \right)^{qn} \left( \frac{1-q}{1-p}\right)^{(1-q)n} = e^{n K(q,p)}.[/latex]

Since $latex \theta_p(\bar{x}) \leq \exp(-nK(q,p)) \theta_q(\bar{x})$ for every $latex \bar{x} \in S$, we have [latex]\theta_p(S) \leq \exp(-nK(q,p)) \theta_q(S) \leq \exp(-nK(q,p))[/latex] and we’re done.

4/30/2007

COLT 2007

Registration for COLT 2007 is now open.

The conference will take place on 13-15 June, 2007, in San Diego, California, as part of the 2007 Federated Computing Research Conference (FCRC), which includes STOC, Complexity, and EC.

The website for COLT: http://www.learningtheory.org/colt2007/index.html

The early registration deadline is May 11, and the cutoff date for discounted hotel rates is May 9.

Before registering, take note that the fees are substantially lower for members of ACM and/or SIGACT than for nonmembers. If you’ve been contemplating joining either of these two societies (annual dues: $99 for ACM, $18 for SIGACT), now would be a good time!

9/7/20069/9/2006

Objective and subjective interpretations of probability

An amusing tidbit (reproduced without permission) from Herman Chernoff’s delightful monograph, “Sequential analysis and optimal design”:

The use of randomization raises a philosophical question which is articulated by the following probably apocryphal anecdote.

The metallurgist told his friend the statistician how he planned to test the effect of heat on the strength of a metal bar by sawing the bar into six pieces. The first two would go into the hot oven, the next two into the medium oven, and the last two into the cool oven. The statistician, horrified, explained how he should randomize to avoid the effect of a possible gradient of strength in the metal bar. The method of randomization was applied, and it turned out that the randomized experiment called for putting the first two pieces into the hot oven, the next two into the medium oven, and the last two into the cool oven. “Obviously, we can’t do that,” said the metallurgist. “On the contrary, you have to do that,” said the statistician.

What are arguments for and against this design? In a “larger” design or sample, the effect of a reasonable randomization scheme could be such that this obvious difficulty would almost certainly not happen. Assuming that the original strength of the bar and the heat treatment did not “interact” in a complicated nonlinear way, the randomization would virtually cancel out any effect due to a strength gradient or other erratic phenomena, and computing estimates as though these did not exist would lead to no real error. In this small problem, the effect may not be cancelled out, but the statistician still has a right to close his eyes to the design actually selected if he is satisfied with “playing fair”. That is, if he instructs an agent to select the design and he analyzes the results, assuming there are no gradients, his conclusions will be unbiased in the sense that a tendency to overestimate is balanced on the average by a tendency to underestimate the desired quantities. However, this tendency may be substantial as measured by the variability of the estimates which will be affected by substantial gradients. On the other hand, following the natural inclination to reject an obviously unsatisfactory design resulting from randomization puts the statistician in the position of not “playing fair”. What is worse for an objective statistician, he has no way of evaluating in advance how good his procedure is if he can change the rules in the middle of the experiment.

The Bayesian statistician, who uses subjective probability and must consider all information, is unsatisfied to simply play fair. When randomization leads to the original unsatisfactory design, he is aware of this information and unwilling to accept the design. In general, the religious Bayesian states that no good and only harm can come from randomized experiments. In principle, he is opposed even to random sampling in opinion polling. However, this principle puts him in untenable computational positions, and a pragmatic Bayesian will often ignore what seems useless design information if there are no obvious quirks in a randomly selected sample.

7/25/20067/26/2006

Upcoming conference

The Workshop for Women in Machine Learning will be held in San Diego on October 4, 2006.

For details see the workshop website:
http://www.seas.upenn.edu/~wiml/