Exact calculation of (covered in the next subsection) can require computation at least proportional to , which is often too expensive. For the bounds in this thesis, we will only need to calculate an upper bound on the quantity . There are several inequalities which are often used. The first of these is the Hoeffding inequality[23]. Assume that then we have: Intuitively, this inequality can be seen as fitting a gaussian to the Binomial distribution with . For any particular , the variance of the Binomial distribution is maximized when . Therefore, the Hoeffding inequality is relatively tight when . Unfortunately, the Hoeffding approximation is not tight enough for our purposes. In machine learning, our goal is to find a hypothesis with a true error rate far away from where the Hoeffding inequality becomes loose.
There is another bound known as the “realizable bound” which applies only when . The realizable bound is: The realizable bound is noticeably tighter with an exponent proportional to rather than . The disadvantage of the realizable bound is that it only applies to a very limited setting - when our empirical error rate happens to be .
Luckily, there exists a quickly calculable bound which achieves the generality of the Hoeffding bound along with the tightness of the realizable bound. We have the relative entropy Chernoff bound [7] for :
(3.2.1) |
We are concerned with the different bounds here because much of the learning theory literature (see [50], [20], [39] for examples) works with either the realizable bound or the Hoeffding bound, or both. In contrast, we will work with either the relative Chernoff bound or the exact tail probability, . There are several advantages to this approach:
The principle disadvantage of this approach is that both the relative entropy Chernoff bound and are not analytically invertible. Lack of invertibility is a theoretical disadvantage because it means we can not easily parameterize our “precision” parameter, in terms of . Nonetheless, this is not a severe computational disadvantage because the quantity is convex in implying that a binary search is capable of solving the inequality. The process of (and need for) inversion is discussed next.