{"id":250,"date":"2007-02-02T23:05:57","date_gmt":"2007-02-03T05:05:57","guid":{"rendered":"http:\/\/hunch.net\/?p=250"},"modified":"2007-02-02T23:05:57","modified_gmt":"2007-02-03T05:05:57","slug":"thoughts-regarding-is-machine-learning-different-from-statistics","status":"publish","type":"post","link":"https:\/\/hunch.net\/?p=250","title":{"rendered":"Thoughts regarding &#8220;Is machine learning different from statistics?&#8221;"},"content":{"rendered":"<p>Given John&#8217;s recent posts on CMU&#8217;s new machine learning department and &#8220;Deep Learning,&#8221; I asked for an opportunity to give a computational learning theory perspective on these issues.<\/p>\n<p>To my mind, the answer to the question &#8220;Are the core problems from machine learning different from the core problems of statistics?&#8221; is a clear Yes.  The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP&#8211; which, for all we know, is true&#8211; then we would suddenly find almost all of our favorite machine learning problems considerably more tractable).<\/p>\n<p>If the central question of statistical learning theory were crudely summarized as &#8220;given a hypothesis with a certain loss bound over a test set, how well will it generalize?&#8221; then the central question of computational learning theory might be &#8220;how can we find such a hypothesis efficently (e.g., in polynomial-time)?&#8221;<\/p>\n<p>With this in mind, let&#8217;s spend the rest of the post thinking about agnostic learning.  The following definition is due to Kearns, Schapire, and Sellie:<\/p>\n<p>Fix a function class C. Given a training set drawn from an arbitrary distribution D and *arbitrarily labeled* (that is, we are given samples from an arbitrary distribution D over some set <i>X x Y<\/i>), efficiently output a hypothesis that is competitive with the best function in C with respect to D.  If there exists a provably efficient solution for the above problem we say C is agnostically learnable with respect to D.  Notice the difference between agnostic and PAC learning: we *do not assume* that the data is labeled according to a fixed function (I find this to be a common criticism of the PAC model).<\/p>\n<p>To make this a little more concrete, let C be the class of halfspaces, let X = <i>R<sup>n<\/sup>, and let Y = <\/i><i>{-1,1}<\/i>.  Let <i> opt = min<sub>{h in C}<\/sub> <\/i> (classification error of h with respect to D). Then the goal is, given an *arbitrarily* labeled data set of points in<i> R<sup>n<\/sup> <\/i>, find a hypothesis whose true error with respect to D is at most <i>opt + eps<\/i>.  Furthermore the solution should provably run in time polynomial in <i>n<\/i> and <i>1\/eps<\/i>.  Note that the hypothesis we output *need not be a halfspace*&#8211; we only require that its error rate is close to the error rate of the optimal halfspace.<\/p>\n<p>Why choose C to be halfspaces?  At the heart of many of our most popular learning tools, such as Support Vector Machines, is an algorithm for learning a halfspace over a set of features in many dimensions. Perhaps surprisingly, *it is still not known*, given samples from an arbitrary distribution on <i>X x Y<\/i>, how to efficiently &#8216;find the best fitting halfspace&#8217; over a set of features.  In fact, if we require our algorithm to output a halfspace as its hypothesis then the agnostic halfspace learning problem is NP-hard.  <a HREF=\"http:\/\/eccc.hpi-web.de\/eccc-reports\/2006\/TR06-061\/index.html\">Recent<\/a> <a HREF=\"http:\/\/eccc.hpi-web.de\/eccc-reports\/2006\/TR06-059\/index.html\">results<\/a> have shown that even in the case where there is a halfspace with error rate .0001 with respect to D it is NP-hard to output a halfspace with error rate .49999 (achieving .5 is of course trivial).<\/p>\n<p>The astute reader here may comment that I am ignoring the whole point of the &#8216;kernel trick,&#8217; which is to provide an implicit representation of a halfspace over some feature set in many dimensions, therefore bypassing the above hardness result. While this is true, I am still not aware of any results in the statistical learning literature that give provably efficient algorithms for agnostically learning halfspaces (regardless of the choice of kernel).<\/p>\n<p>My larger point is that a fundamental approach from statistical learning theory (running an SVM) is deeply connected to strong negative results from computational complexity theory. Still, as I explain in the next paragraph, finding the &#8216;best fitting halfspace&#8217; remains an open and important computational problem.<\/p>\n<p>Negative results not withstanding, it now seems natural to ask &#8216;well, can we find any efficient algorithms for agnostically learning halfspaces if we allow ourselves to output hypotheses that are not halfspaces?&#8217; Briefly, the answer is &#8216;<a HREF=\"http:\/\/www.cs.columbia.edu\/~rocco\/papers\/focs05kkms.html\">to some extent Yes<\/a>,&#8217; if we restrict the marginal distribution on X to be one of our favorite nice distributions (such as Gaussian over R^n). The solution runs in polynomial-time for any constant <i>eps<\/i> &gt; 0 (for various reasons involving the &#8216;noisy XOR&#8217; topic three posts ago, obtaining efficient algorithms for subconstant <i>eps<\/i> seems intractable).<\/p>\n<p>The complexity of agnostically learning halfspaces with respect to arbitrary distributions remains open.  A solution (if one exists) will require a powerful new algorithmic idea and will have important consequences regarding our most commonly used machine learning tools.<\/p>\n<p>To respond to Andrej Bauer&#8217;s comment three posts ago:<\/p>\n<p>&#8216;I guess I have a more general question. I understand that for the purposes of proving lower bounds in computational complexity it&#8217;s reasonable to show theorems like &#8220;can&#8217;t learn XOR with NOT, AND, OR.&#8221; But I always thought that machine learning was more &#8220;positive&#8221;, i.e., you really want to learn whenever possible. So, instead of dispairing one should try to use a different bag of tricks.&#8217;<\/p>\n<p>I think Andrej is right.  Solving future machine learning problems will require a more sophisticated &#8216;bag of tricks&#8217; than the one we have now, as most of our algorithms are based on learning halfspaces (in a noisy or agnostic setting).  Computational learning theory gives us a methodology for how to proceed: find provably efficient solutions to unresolved computational problems&#8211; novel learning algorithms will no doubt follow.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Given John&#8217;s recent posts on CMU&#8217;s new machine learning department and &#8220;Deep Learning,&#8221; I asked for an opportunity to give a computational learning theory perspective on these issues. To my mind, the answer to the question &#8220;Are the core problems from machine learning different from the core problems of statistics?&#8221; is a clear Yes. The &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/hunch.net\/?p=250\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Thoughts regarding &#8220;Is machine learning different from statistics?&#8221;&#8221;<\/span><\/a><\/p>\n","protected":false},"author":164,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37,29],"tags":[],"class_list":["post-250","post","type-post","status-publish","format-standard","hentry","category-computation","category-machine-learning"],"_links":{"self":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/250","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/users\/164"}],"replies":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=250"}],"version-history":[{"count":0,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/250\/revisions"}],"wp:attachment":[{"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=250"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=250"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=250"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}