{"id":233,"date":"2006-10-08T09:16:52","date_gmt":"2006-10-08T15:16:52","guid":{"rendered":"http:\/\/hunch.net\/?p=233"},"modified":"2006-10-08T09:16:52","modified_gmt":"2006-10-08T15:16:52","slug":"incompatibilities-between-classical-confidence-intervals-and-learning","status":"publish","type":"post","link":"https:\/\/hunch.net\/?p=233","title":{"rendered":"Incompatibilities between classical confidence intervals and learning."},"content":{"rendered":"<p>Classical confidence intervals satisfy a theorem of the form: For some data sources <em>D<\/em>,<br \/>\n<center><em>Pr<sub>S ~ D<\/sub>(f(D) > g(S)) > 1-d<\/em><\/center><br \/>\nwhere <em>f<\/em> is some function of the distribution (such as the mean) and <em>g<\/em> is some function of the observed sample <em>S<\/em>.   The constraints on <em>D<\/em> can vary between &#8220;Independent and identically distributed (IID) samples from a gaussian with an unknown mean&#8221; to &#8220;IID samples from an arbitrary distribution <em>D<\/em>&#8220;.  There are even some confidence intervals which do not require IID samples.<\/p>\n<p>Classical confidence intervals often confuse people.  They do <em>not<\/em> say &#8220;with high probability, for my observed sample, the bounds holds&#8221;.   Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on <em>D<\/em> are satisfied), then you are not often wrong.  Restated, they tell you something about what a safe procedure is in a stochastic world where <em>d<\/em> is the safety parameter.<\/p>\n<p>There are a number of results in theoretical machine learning which use confidence intervals.  For example, <\/p>\n<ol>\n<li>The <a href=\"http:\/\/www.cis.upenn.edu\/~mkearns\/papers\/reinforcement.ps\">E<sup>3<\/sup><\/a> algorithm uses confidence intervals to learn a near optimal policy for any MDP with high probability.<\/li>\n<li>\n<\/li>\n<li><a href=\"http:\/\/www.ift.ulaval.ca\/~mmarchand\/publications\/HalfICML.pdf\">Set Covering Machines<\/a> minimize a confidence interval upper bound on the true error rate of a learned classifier.<\/li>\n<li>The <a href=\"https:\/\/hunch.net\/~jl\/projects\/agnostic_active\/agnostic-active.pdf\">A<sup>2<\/sup><\/a> uses confidence intervals to safely deal with arbitrary noise while taking advantage of active learning.<\/li>\n<\/ol>\n<p>Suppose that we want to generalize thse algorithms in a reductive style.  The goal would be to train a regressor to predict the output of <em>g(S)<\/em> for new situations.  For example, a good regression prediction of <em>g(S)<\/em> might allow E<sup>3<\/sup> to be applied to much larger state spaces.  Unfortunately, this approach seems to fail badly due to a mismatch between the semantics of learning and the semantics of a classical confidence interval.<\/p>\n<ol>\n<li>It&#8217;s difficult to imagine a constructive sampling mechanism.  In a large state space, we may never encounter the same state twice, so we can not form meaningful examples of the form &#8220;for this state-action, the correct confidence interval is <em>y<\/em>&#8220;.<\/li>\n<li>When we think of succesful learning, we typically think of it in an l<sub>1<\/sub> sense&#8212;the expected error rate over the data generating distribution is small.  Confidence intervals have a much stronger meaning as we would like to apply them: with high probability, in all applications, the confidence interval holds.  This mismatch appears unaddressable.<\/li>\n<\/ol>\n<p>It is tempting to start plugging in other notions such as Bayesian confidence intervals or quantile regression systems.  Making these approaches work at a theoretical level on even simple systems is an open problem, but there is plenty of motivation to do so.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Classical confidence intervals satisfy a theorem of the form: For some data sources D, PrS ~ D(f(D) > g(S)) > 1-d where f is some function of the distribution (such as the mean) and g is some function of the observed sample S. The constraints on D can vary between &#8220;Independent and identically distributed (IID) &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/hunch.net\/?p=233\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Incompatibilities between classical confidence intervals and learning.&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[29,12,11,34],"tags":[],"class_list":["post-233","post","type-post","status-publish","format-standard","hentry","category-machine-learning","category-reductions","category-reinforcement","category-statistics"],"_links":{"self":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/233","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=233"}],"version-history":[{"count":0,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/233\/revisions"}],"wp:attachment":[{"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=233"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=233"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=233"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}