{"id":4464948,"date":"2016-07-11T07:14:43","date_gmt":"2016-07-11T13:14:43","guid":{"rendered":"http:\/\/hunch.net\/?p=4464948"},"modified":"2016-07-11T07:14:43","modified_gmt":"2016-07-11T13:14:43","slug":"the-multiworld-testing-decision-service","status":"publish","type":"post","link":"https:\/\/hunch.net\/?p=4464948","title":{"rendered":"The Multiworld Testing Decision Service"},"content":{"rendered":"<p>We made a <a href=\"http:\/\/arxiv.org\/abs\/1606.03966\">tool<\/a> that you can <a href=\"http:\/\/aka.ms\/mwt\">use<\/a>. It is the first general purpose reinforcement-based learning system \ud83d\ude42<\/p>\n<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Reinforcement_learning\">Reinforcement learning<\/a> is much discussed these days with successes like <a href=\"https:\/\/en.wikipedia.org\/wiki\/AlphaGo\">AlphaGo<\/a>.  Wouldn&#8217;t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems?  But there is a well-known problem: It&#8217;s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc&#8230;) fail catastrophically.  That&#8217;s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.  <\/p>\n<p>Removing the credit assignment problem from reinforcement learning yields the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Multi-armed_bandit#Contextual_Bandit\">Contextual Bandit<\/a> setting which <a href=\"http:\/\/arxiv.org\/abs\/1402.0555\">we know is generically solvable<\/a> in the same manner as common <a href=\"https:\/\/en.wikipedia.org\/wiki\/Supervised_learning\">supervised learning<\/a> problems.  I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.<\/p>\n<p>Can we make this dramatically easier?  We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes.  These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision.  Naturally, this is what we&#8217;ve done and now it can be <a href=\"http:\/\/aka.ms\/mwt\">used by anyone<\/a>.  This drops the barrier to use down to: &#8220;Do you have permissions?  And do you have a reasonable idea of what a good <a href=\"https:\/\/en.wikipedia.org\/wiki\/Feature_%28machine_learning%29\">feature<\/a> is?&#8221;<\/p>\n<p>A key foundational idea is <strong>Multiworld Testing<\/strong>: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard <a href=\"https:\/\/en.wikipedia.org\/wiki\/A%2FB_testing\">A\/B testing<\/a>.  This is used pervasively in the Contextual Bandit literature and you can see it in action for <a href=\"http:\/\/arxiv.org\/abs\/1606.03966\">the system<\/a> we&#8217;ve <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/multi-world-testing-mwt\/\">made at Microsoft Research<\/a>.  The key design principles are:<\/p>\n<ol>\n<li>Contextual Bandits.  Many people have tried to create online learning system that do not take into account the biasing effects of decisions.  These fail near-universally.  For example they might be very good at predicting what <em>was<\/em> shown (and hence clicked on) rather that what <em>should<\/em> be shown to generate the most interest.<\/li>\n<li>Data Lifecycle support.  This system supports the entire process of data collection, joining, learning, and deployment.  Doing this eliminates many stupid-but-killer bugs that I&#8217;ve seen in practice.<\/li>\n<li>Modularity.  The system decomposes into pieces: exploration library, client library, online learner, join server, etc&#8230;  because I&#8217;ve seen to many cases where the pieces are useful but the system is not.  <\/li>\n<li>Reproducibility.  Everything is logged in a fashion which makes online behavior offline reproducible.  Consequently, the system is debuggable and hence improvable.<\/li>\n<\/ol>\n<p>The system we&#8217;ve created is open source with system components in <a href=\"https:\/\/github.com\/Microsoft\/mwt-ds\">mwt-ds<\/a> and the core learning algorithms in <a href=\"https:\/\/github.com\/JohnLangford\/vowpal_wabbit\">Vowpal Wabbit<\/a>.  If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions.   This is strongly scalable, for example a version of this is in use for personalized news on <a href=\"http:\/\/www.msn.com\/\">MSN<\/a>.  It can be either low-latency (with a client side library) or cross platform (with a <a href=\"https:\/\/en.wikipedia.org\/wiki\/JSON\">JSON<\/a> <a href=\"https:\/\/en.wikipedia.org\/wiki\/Representational_state_transfer\">REST<\/a> web interface).  Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines.  The system autodeploys into a chosen <a href=\"https:\/\/account.windowsazure.com\/Home\/Index\">Azure<\/a> account with a baseline cost of about $0.20\/hour.  The autodeployment takes a few minutes after which you can test or use the system as desired.<\/p>\n<p>This system is open source and there are many ways for people to help if they are interested.  For example, support for the client-side library in more languages, support of other learning algorithms &#038; systems, better documentation, etc&#8230; are all obviously useful.  <\/p>\n<p>Have fun.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We made a tool that you can use. It is the first general purpose reinforcement-based learning system \ud83d\ude42 Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn&#8217;t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It&#8217;s very &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/hunch.net\/?p=4464948\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;The Multiworld Testing Decision Service&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,42,74,29,7,11],"tags":[],"class_list":["post-4464948","post","type-post","status-publish","format-standard","hentry","category-announcements","category-code","category-interactive","category-machine-learning","category-online","category-reinforcement"],"_links":{"self":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/4464948","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4464948"}],"version-history":[{"count":0,"href":"https:\/\/hunch.net\/index.php?rest_route=\/wp\/v2\/posts\/4464948\/revisions"}],"wp:attachment":[{"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4464948"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4464948"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hunch.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4464948"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}