{"id":1729,"date":"2011-03-19T16:24:51","date_gmt":"2011-03-19T22:24:51","guid":{"rendered":"http:\/\/hunch.net\/?p=1729"},"modified":"2011-03-19T16:24:51","modified_gmt":"2011-03-19T22:24:51","slug":"the-ideal-large-scale-learning-class","status":"publish","type":"post","link":"https:\/\/hunch.net\/?p=1729","title":{"rendered":"The Ideal Large Scale Learning Class"},"content":{"rendered":"<p>At NIPS, <a href=\"http:\/\/ai.stanford.edu\/~ang\/\">Andrew Ng<\/a> asked me what should be in a large scale learning class.  After some discussion with him and <a href=\"http:\/\/www.cs.ubc.ca\/~nando\/\">Nando<\/a> and mulling it over a bit, these are the topics that I think should be covered.<\/p>\n<p>There are many different kinds of scaling.<\/p>\n<ol>\n<li><strong>Scaling in examples<\/strong>  This is the most basic kind of scaling.\n<ol>\n<li><strong>Online Gradient Descent<\/strong> This is an old algorithm&#8212;I&#8217;m not sure if anyone can be credited with it in particular.  Perhaps the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Perceptron\">Perceptron<\/a> is a good precursor, but substantial improvements come from the notion of a loss function of which <a href=\"http:\/\/en.wikipedia.org\/wiki\/Least_squares\">squared loss<\/a>, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Logistic_regression\">logistic loss<\/a>, Hinge Loss, and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Quantile_regression\">Quantile Loss<\/a> are all worth covering.  It&#8217;s important to cover the <a href=\"https:\/\/hunch.net\/?p=269\">semantics<\/a> of these loss functions as well.  <a href=\"https:\/\/hunch.net\/~vw\">Vowpal Wabbit<\/a> is a reasonably fast codebase implementing these.<\/li>\n<li><strong>Second Order Gradient Descent methods<\/strong> For some problems, methods taking into account second derivative information can be more effective.  I&#8217;ve seen preconditioned conjugate gradient work well, for which <a href=\"http:\/\/www.cs.berkeley.edu\/~jrs\/\">Jonathan Shewchuck<\/a>&#8216;s <a href=\"http:\/\/www.cs.cmu.edu\/~jrs\/jrspapers.html#cg\">writeup<\/a> is reasonable. Nando likes <a href=\"http:\/\/en.wikipedia.org\/wiki\/L-BFGS\">L-BFGS<\/a> which I don&#8217;t have much experience with.<\/li>\n<li><strong>Map-Reduce<\/strong> I have a love-hate relationship with the Map-Reduce framework.  In my experience, it&#8217;s an excellent filesystem, but it&#8217;s quite frustrating to do machine learning with, since it encourages the parallelization of slow learning algorithms.  I liked what <a href=\"http:\/\/cs.markusweimer.com\/\">Markus<\/a> said at the <a href=\"http:\/\/lccc.eecs.berkeley.edu\/\">LCCC workshop<\/a>: nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa.  The best way to use this for Machine Learning isn&#8217;t yet clear to me. <a href=\"http:\/\/hadoop.apache.org\/\">Hadoop<\/a> is probably the most commonly used open source implementation of Map-Reduce.<\/li>\n<\/ol>\n<\/li>\n<li><strong>Feature Scaling<\/strong>&#8212;what do you do when you have very many features?\n<ol>\n<li><a href=\"https:\/\/hunch.net\/~jl\/projects\/hash_reps\/index.html\">Hashing approaches<\/a> are surprisingly effective in my experience.  It&#8217;s a good idea to also present <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bloom_filters\">Bloom Filters<\/a>, as they help with the intuition of where this works substantially.<\/li>\n<li>Online <em>l<sub>1<\/sub><\/em> regularization is via <a href=\"http:\/\/jmlr.csail.mit.edu\/papers\/volume10\/langford09a\/langford09a.pdf\">truncated gradient<\/a>.  See Bob Carpenter&#8217;s <a href=\"http:\/\/lingpipe-blog.com\/2009\/05\/07\/langford-li-and-zhang-2009-sparse-online-learning-via-truncated-gradient\/\">discussion<\/a>. John Duchi&#8217;s <a href=\"http:\/\/www.cs.berkeley.edu\/~jduchi\/projects\/DuchiShSiTe10.html\">composite mirror descent<\/a> generalization is also a useful general treatment.<\/li>\n<li><a href=\"http:\/\/en.wikipedia.org\/wiki\/Boosting\">Boosting<\/a> based approaches can also be effective, although training time can become problematic.  This is partially mitigated by parallelization algorithms as discussed at the <a href=\"http:\/\/lccc.eecs.berkeley.edu\/\">LCCC workshop<\/a> See  <a href=\"http:\/\/videolectures.net\/nipsworkshops2010_ye_gbd\/\">Jerry Ye&#8217;s talk<\/a> and <a href=\"http:\/\/videolectures.net\/nipsworkshops2010_svore_lrc\/\">Krysta&#8217;s talk.<\/a>.  A really good public implementation of this is so far missing, as far as I know.<\/li>\n<\/ol>\n<\/li>\n<li><strong>Test-time Evaluation<\/strong>  Ultrafast and efficient test-time evaluation seems to be a goal independent of training.\n<ol>\n<li><strong>Indicies<\/strong> One way to speed things up is with inverted indicies.  Aside from the basic datastructure, I&#8217;d cover <a href=\"http:\/\/www.cis.upenn.edu\/~jstoy\/cis650\/papers\/WAND.pdf\">WAND<\/a> and <a href=\"https:\/\/hunch.net\/~jl\/projects\/predictive_indexing\/predictive_indexing.pdf\">predictive indexing<\/a>.<\/li>\n<li><strong>GPU<\/strong> The use of GPU&#8217;s to make evaluation both more efficient and fast seems to make sense in many applications.<\/li>\n<\/ol>\n<\/li>\n<li><strong>Labels<\/strong>\n<ol>\n<li><strong>Sourcing<\/strong> Just acquiring sufficient label information can be problematic.\n<ol>\n<li><a href=\"https:\/\/www.mturk.com\/mturk\/welcome\">Mechanical Turk<\/a> can be an efficient approach. The basic approach can be improved with <a href=\"http:\/\/books.nips.cc\/papers\/files\/nips23\/NIPS2010_0577.pdf\">some work<\/a>.<\/li>\n<li>When you are paying directly for labels, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Active_learning_%28machine_learning%29\">active learning<\/a> approaches can substantially cut your costs.  <a href=\"http:\/\/burrsettles.com\/\">Burr Settles<\/a> <a href=\"http:\/\/active-learning.net\/\">active learning survey<\/a> is pretty comprehensive, although if I was to cover one algorithm, it would be <a href=\"http:\/\/arxiv.org\/abs\/1006.2588\">this one<\/a> which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality.<\/li>\n<li>The other common approach is user-feedback information where bias and exploration effects becomes a critical concern.  The tutorial <a href=\"https:\/\/hunch.net\/~beygel\/\">Alina<\/a> and I did on <a href=\"http:\/\/videolectures.net\/kdd2010_beygelzimer_langford_lte\/\">learning and exploration<\/a> is critical here.<\/li>\n<\/ol>\n<\/li>\n<li><strong>Many Labels<\/strong>  It&#8217;s common to need to make a complex prediction.\n<ol>\n<li><strong>Label Tree<\/strong> based approaches are a good starting point.  I&#8217;d discuss the inconsistency of the naive approach and the Filter Tree, <a href=\"http:\/\/arxiv.org\/abs\/0902.3176\">discussed here<\/a>.  <a href=\"http:\/\/arxiv.org\/abs\/0903.4217\">Online tree building and conditional probability trees<\/a> are also potentially extremely useful.  Building smarter trees can help, such as with a <a href=\"http:\/\/www.kyb.mpg.de\/bs\/people\/weston\/papers\/trees.pdf\">confusion matrix<\/a> or in an <a href=\"http:\/\/www.cs.toronto.edu\/~amnih\/papers\/hlbl_final.pdf\">iterative fashion<\/a>.<\/li>\n<li>Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples.  Here <strong>Structured Prediction<\/strong> techniques become particularly important.  I&#8217;d cover <a href=\"http:\/\/www.cs.utah.edu\/~hal\/searn\/\">Searn<\/a> as well as some of <a href=\"http:\/\/www.autonlab.org\/autonweb\/10243.html\">Drew Bagnell<\/a>&#8216;s work such as <a href=\"http:\/\/arxiv.org\/abs\/1011.0686\">this one<\/a>.  Many other people are still interested in <a href=\"http:\/\/en.wikipedia.org\/wiki\/Conditional_random_field\">CRF<\/a>s or <a href=\"http:\/\/books.nips.cc\/papers\/files\/nips16\/NIPS2003_AA04.pdf\">Max-Margin Markov Networks<\/a> which I find somewhat less compelling for computational reasons.<\/li>\n<li><strong>Cascade Learning<\/strong> is also a compelling approach.  The canonical paper on this is the <a href=\"http:\/\/www.cmucam.org\/raw-attachment\/wiki\/viola-jones\/viola-ijcv04.pdf\">Viola-Jones Face Detector<\/a>.  I&#8217;m sure there&#8217;s much other vision-related work on cascades that I&#8217;m unfamiliar.  A more recent instance is the <a href=\"http:\/\/www.seas.upenn.edu\/~taskar\/pubs\/aistats10cascades.pdf\">structured prediction cascade<\/a>.<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<\/li>\n<\/ol>\n<p>What else is essential and missing?<\/p>\n","protected":false},"excerpt":{"rendered":"<p>At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered. There are many different kinds of scaling. Scaling in examples This is the most basic kind of &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/hunch.net\/?p=1729\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;The Ideal Large Scale Learning Class&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[29,5],"tags":[],"class_list":["post-1729","post","type-post","status-publish","format-standard","hentry","category-machine-learning","category-teaching"],"_links":{"self":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/1729","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1729"}],"version-history":[{"count":0,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/1729\/revisions"}],"wp:attachment":[{"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1729"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1729"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1729"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}