{"id":171,"date":"2006-03-02T18:17:40","date_gmt":"2006-03-03T00:17:40","guid":{"rendered":"http:\/\/hunch.net\/?p=171"},"modified":"2006-03-02T18:18:23","modified_gmt":"2006-03-03T00:18:23","slug":"why-do-people-count-for-learning","status":"publish","type":"post","link":"https:\/\/hunch.net\/?p=171","title":{"rendered":"Why do people count for learning?"},"content":{"rendered":"<p>This post is about a confusion of mine with respect to many commonly used machine learning algorithms.   <\/p>\n<p>A simple example where this comes up is Bayes net prediction.  A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence.  The joint probability distribution over the variables is given by a set of conditional probabilities.  For example, a very simple Bayes net might express:<br \/>\n<em>P(A,B,C) = P(A | B,C)P(B)P(C)<\/em><\/p>\n<p>What I don&#8217;t understand is the mechanism commonly used to estimate <em>P(A | B, C)<\/em>.  If we let <em>N(A,B,C)<\/em> be the number of instances of <em>A,B,C<\/em> then people sometimes form an estimate according to:<br \/>\n<center><em>P'(A | B,C) = N(A,B,C) \/ N \/[N(B)\/N * N(C)\/N] = N(A,B,C) N \/[N(B)  N(C)]<\/em><\/center><br \/>\n&#8230; in other words, people just estimate <em>P'(A | B,C)<\/em> according to observed relative frequencies.  This is a reasonable technique when you have a large number of samples compared to the size space <em>A x B x C<\/em>, but it (naturally) falls apart when this is not the case as typically happens with &#8220;big&#8221; learning problems such as machine translation, vision, etc&#8230;<\/p>\n<p>To compensate, people often try to pick some prior (such as Dirichlet prior with one &#8220;virtual count&#8221; per joint parameter setting) to provide a reasonable default value for the count.   Naturally, in the &#8220;big learning&#8221; situations where this applies, the precise choice of prior can greatly effect the system performance leading to finicky tuning of various sorts.  It&#8217;s also fairly common to fit some parametric model (such as a Gaussian) in an attempt to predict <em>A<\/em> given <em>B<\/em> and <em>C<\/em>.<\/p>\n<p>Stepping back a bit, we can think of the estimation of <em>P(A | B, C)<\/em> as a simple self-contained prediction (sub)problem.  Why don&#8217;t we use existing technology for doing this prediction?  Viewed as a learning algorithm &#8220;counting with a Dirichlet prior&#8221; is exactly memorizing the training set and then predicting according to either (precisely) matching training set elements or using a default.  It&#8217;s hard to imagine a more primitive learning algorithm.  <\/p>\n<p>There seems to be little downside to trying this approach.  In low count situations, a general purpose prediction algorithm has a reasonable hope of performing well.  In a high count situation, any reasonable general purpose algorithm converges to the same estimate as above.  In either case something reasonable happens.<\/p>\n<p>Using a general purpose probabilistic prediction algorithm isn&#8217;t a new idea, (for example, see <a href=\"http:\/\/citeseer.ist.psu.edu\/compress\/0\/papers\/cs\/20655\/http:zSzzSzwww.ai.mit.eduzSzprojectszSzjmlrzSzpaperszSzvolume1zSzheckerman00azSzheckerman00a.ps.gz\/heckerman00dependency.ps\">page 57<\/a>), but it appears greatly underutilized.  This is a very small modification of existing systems with a real hope of dealing with low counts in {speech recognition, machine translation, vision}.  It seems that using a general purpose probabilistic prediction algorithm should be the default rather than the exception.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post is about a confusion of mine with respect to many commonly used machine learning algorithms. A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/hunch.net\/?p=171\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Why do people count for learning?&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,29],"tags":[],"class_list":["post-171","post","type-post","status-publish","format-standard","hentry","category-bayesian","category-machine-learning"],"_links":{"self":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/171","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=171"}],"version-history":[{"count":0,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/171\/revisions"}],"wp:attachment":[{"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=171"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=171"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}