{"id":1253,"date":"2010-03-15T16:53:06","date_gmt":"2010-03-15T22:53:06","guid":{"rendered":"http:\/\/hunch.net\/?p=1253"},"modified":"2010-03-15T16:53:06","modified_gmt":"2010-03-15T22:53:06","slug":"the-efficient-robust-conditional-probability-estimation-problem","status":"publish","type":"post","link":"https:\/\/hunch.net\/?p=1253","title":{"rendered":"The Efficient Robust Conditional Probability Estimation Problem"},"content":{"rendered":"<p>I&#8217;m offering a reward of $1000 for a solution to this problem.  This joins the <a href=\"https:\/\/hunch.net\/?p=29\">cross validation problem<\/a> which I&#8217;m offering a <a href=\"https:\/\/hunch.net\/?p=91\">$500 reward<\/a> for.  I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value.  While it&#8217;s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be \ud83d\ude42<\/p>\n<h2>The Problem<\/h2>\n<p>The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability <em>P(y|x)<\/em> where robustness and efficiency are measured using techniques from learning reductions. <\/p>\n<p>In particular, suppose we have access to a binary regression oracle <em>B<\/em> which has two interfaces&#8212;one for specifying training information and one for testing. Training information is specified as <em>B(x&#8217;,y&#8217;)<\/em> where <em>x&#8217;<\/em> is a feature vector and <em>y&#8217;<\/em> is a scalar in <em>[0,1]<\/em> with no value returned.  Testing is done according to <em>B(x&#8217;)<\/em> with a value in <em>[0,1]<\/em> returned.<\/p>\n<p>A learning reduction consists of two algorithms <em>R<\/em> and <em>R<sup>-1<\/sup><\/em> which transform examples from the original input problem into examples for the oracle and then transform the oracle&#8217;s predictions into a prediction for the original problem.<\/p>\n<p>The algorithm <em>R<\/em> takes as input a single example <em>(x,y)<\/em> where <em>x<\/em> is an feature vector and <em>y<\/em> is a discrete variable taking values in <em>{1,&#8230;,k}<\/em>.  <em>R<\/em> then specifies a training example <em>(x&#8217;,y&#8217;)<\/em> for the oracle <em>B<\/em>.  <em>R<\/em> can then create another training example for <em>B<\/em> based on all available information. This process repeats some finite number of times before halting without returning information.<\/p>\n<p>A basic observation is that for any oracle algorithm, a distribution <em>D(x,y)<\/em> over multiclass examples and a reduction <em>R<\/em> induces a distribution over a sequence <em>(x&#8217;,y&#8217;)<sup>*<\/sup><\/em> of oracle examples. We collapse this into a distribution <em>D'(x&#8217;,y&#8217;)<\/em> over oracle examples by drawing uniformly from the sequence.<\/p>\n<p>The algorithm <em>R<sup>-1<\/sup><\/em> takes as input a single example <em>(x,y)<\/em> and returns a value in <em>[0,1]<\/em> after using (only) the testing interface of <em>B<\/em> zero or more times.<\/p>\n<p>We measure the power of an oracle and a reduction according to squared-loss regret. In particular we have:<br \/>\n<center><br \/>\nreg<em>(D,R<sup>-1<\/sup>)=E<sub>(x,y)~ D<\/sub>[(R<sup>-1<\/sup>(x,y)-D(y|x))<sup>2<\/sup>] <\/em><\/center><br \/>\n and similarly letting <em>m<sub>x&#8217;<\/sub>=E<sub>(x&#8217;,y&#8217;)~ D&#8217;<\/sub>[y&#8217;]<\/em>.<br \/>\n<center><br \/>\nreg<em>(D&#8217;,B)=E<sub>(x&#8217;,y&#8217;)~ D&#8217;<\/sub>(B(x&#8217;) &#8211; m<sub>x&#8217;<\/sub>)<sup>2<\/sup><\/em><\/center><\/p>\n<p>The open problem is to specify <em>R<\/em> and <em>R<sup>-1<\/sup><\/em> satisfying the following theorem:<br \/>\n<strong><br \/>\nFor all multiclass distributions <em>D(x,y)<\/em>, for all binary oracles <em>B<\/em>: The computational complexity of <em>R<\/em> and <em>R<sup>-1<\/sup><\/em> are <em>O(log k)<\/em><br \/>\nand <center><br \/>\nreg<em>(D,R<sup>-1<\/sup>) < =  C<\/em> reg<\/em><em>(D&#8217;,B)<\/em><\/center><br \/>\nwhere <em>C<\/em> is a universal constant.<br \/>\n<\/strong><br \/>\nAlternatively, this open problem is satisfied by proving there exists no deterministic algorithms <em>R,R<sup>-1<\/sup><\/em> satisfying the above theorem statement.<\/p>\n<h2>Motivation<\/h2>\n<p>The problem of conditional probability estimation is endemic to machine learning applications. In fact, in some branches of machine learning, this is simply considered &#8220;the problem&#8221;. Typically conditional probability estimation is done in situations where the conditional probability of only one bit is required, however there are a growing number of applications where a well-estimated conditional probability over a more complex object is required. For example, all known methods for solving general contextual bandit problems require knowledge of or good estimation of <em>P(a | x)<\/em> where <em>a<\/em> is an action.<\/p>\n<p>There is a second intrinsic motivation which is matching the lower bound. No method faster than <em>O(log k)<\/em> can be imagined because the label <em>y<\/em> requires <em>log<sub>2<\/sub> k<\/em> bits to specify and hence read. Similarly it&#8217;s easy to prove no learning reduction can provide a regret ratio with <em>C&lt;1<\/em>.<\/p>\n<p>The motivation for using the learning reduction framework to specify this problem is a combination of generality and the empirical effectiveness in application of learning reductions. Any solution to this will be general because any oracle <em>B<\/em> can be plugged in, even ones which use many strange kinds of prior information, features, and active multitask hierachical (insert your favorite adjective here) structure.<\/p>\n<h2>Related Results<\/h2>\n<p>The state of the art is summarized <a href=\"http:\/\/arxiv.org\/abs\/0903.4217\">here<\/a> which shows it&#8217;s possible to have a learning reduction satisfying the above theorem with either:<\/p>\n<ol>\n<li> <em>C<\/em> replaced by <em>(<\/em>log<em><sub>2<\/sub> k)<sup>2<\/sup><\/em> (using a binary tree structure)<\/li>\n<li> or the computational time increased to <em>O(k)<\/em> (using an error correcting code structure).<\/li>\n<\/ol>\n<p>Hence, answering this open problem in the negative shows that there is an inherent computation vs. robustness tradeoff.<\/p>\n<p>There are two other closely related problems, where similar analysis can be done.<\/p>\n<ol>\n<li> For multiclass classification, where the goal is predicting the most likely class, a result analogous to the open problem is provable using <a href=\"http:\/\/arxiv.org\/abs\/0902.3176\">error correcting tournaments<\/a>.<\/li>\n<li> For multiclass classification in a partial label setting, <a href=\"http:\/\/arxiv.org\/abs\/0812.4044\">no learning reduction can provide a constant regret guarantee<\/a>.<\/li>\n<\/ol>\n<h2>Silly tricks that don&#8217;t work<\/h2>\n<p>Because Learning reductions are not familiar to everyone, It&#8217;s helpful to note certain tricks which do not work here to prevent false leads and provide some intuition.<\/p>\n<p><strong>Ignore <em>B<\/em>&#8216;s predictions and use your favorite learning algorithm instead.<\/strong><\/p>\n<p>This doesn&#8217;t work, because the quantification is for all <em>D<\/em>. Any specified learning algorithm will have some <em>D<\/em> on which it has nonzero regret. On the other hand, because <em>R<\/em> calls the oracle at least once, there is a defined induced distribution <em>D&#8217;<\/em>. Since the theorem must hold for all <em>D<\/em> and <em>B<\/em>, it must hold for a <em>D<\/em> your specified learning algorithm fails on and for a <em>B<\/em> for which reg<em>(D&#8217;,B)=0<\/em> implying the theorem is not satisfied.<\/p>\n<p><strong>Feed random examples into <em>B<\/em> and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.<\/strong><\/p>\n<p>This doesn&#8217;t work because the theorem is stated in terms of squared loss regret rather than squared loss. In particular, if the oracle is given examples of the form <em>(x&#8217;,y&#8217;)<\/em> where <em>y&#8217;<\/em> is uniformly at random either <em>0<\/em> or <em>1<\/em>, any oracle specifying <em>B(x&#8217;)=0.5<\/em> has zero regret.<\/p>\n<p><strong>Feed pseudorandom examples into <em>B<\/em> and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.<\/strong><\/p>\n<p>This doesn&#8217;t work, because the quantification is &#8220;for all binary oracles <em>B<\/em>&#8221;, and there exists one which, knowing the pseudorandom seed, can achieve zero loss (and hence zero regret).<\/p>\n<p><strong>Just use Boosting to drive the LHS to zero.<\/strong><\/p>\n<p>Boosting theorems require a stronger oracle&#8212;one which provides an edge over some constant baseline for each invocation. The oracle here is not limited in this fashion since it could completely err for a small fraction of invocations.<\/p>\n<p><strong>Take an existing structure, parameterize it, randomize over the parameterization, and then average over the random elements.<\/strong><\/p>\n<p>Employing this approach is not straightforward, because the average in <em>D&#8217;<\/em> is over an increased number of oracle examples. Hence, at a fixed expected (over oracle examples) regret, the number of examples allowed to have a large regret is increased.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I&#8217;m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it&#8217;s unlikely these rewards are worth your time on &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/hunch.net\/?p=1253\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;The Efficient Robust Conditional Probability Estimation Problem&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[29,16,12],"tags":[],"class_list":["post-1253","post","type-post","status-publish","format-standard","hentry","category-machine-learning","category-problems","category-reductions"],"_links":{"self":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/1253","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1253"}],"version-history":[{"count":0,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/1253\/revisions"}],"wp:attachment":[{"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1253"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1253"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1253"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}