Machine Learning (Theory)

Headroom for AI development

John Langford — Wed, 05 Mar 2025 20:16:49 +0000

(Dylan Foster and Alex Lamb both helped in creating this.)

In thinking about what are good research problems, it’s sometimes helpful to switch from what is understood to what is clearly possible. This encourages us to think beyond simply improving the existing system. For example, we have seen instances throughout the history of machine learning where researchers have argued for fixing an architecture and using it for short-term success, ignoring potential for long-term disruption. As an example, the speech recognition community spent decades focusing on Hidden Markov Models at the expense of other architectures, before eventually being disrupted by advancements in deep learning. Support Vector Machines were disrupted by deep learning, and convolutional neural networks were displaced by transformers. This pattern may repeat for the current transformer/large language model (LLM) paradigm. Here are some quick calculations suggesting it may be possible to do significantly better along multiple axes. Examples include the following:

Language learning efficiency: A human baby can learn a good model for human language after observing 0.01% of the language tokens typically used to train a large language model.
Representational efficiency: A tiny Portia spider with a brain a million times smaller than a human can plan a course of action and execute it over the course of an hour to catch prey.
Long-term planning and memory: A squirrel caches nuts and returns to them after months of experience, which would correspond to keeping billions of visual tokens in context using current techniques.

The core of this argument is that it is manifestly viable to do better along multiple axes, including sample efficiency and the ability to perform complex tasks requiring memory. All these examples highlight advanced capabilities that can be achieved at scales well below what is required by existing transformer architectures and training methodologies (in terms of either data or compute). This is in no way meant as an attack on transformer architectures; they are a highly disruptive technology, accomplishing what other types of architectures have not, and they will likely serve as a foundation for further advances. However, there is much more to do.

Next, we delve into each of the examples above in greater detail.

Sample complexity: The language learning efficiency gap

The sample efficiency gap is perhaps best illustrated by considering the core problem of language modeling where a transformer is trained to learn language. A human baby starts with no appreciable language but learns it well before adulthood. Reading at 300 words per second with 1.3 tokens/word on average implies 6.5 tokens/second. Speaking is typically about half of reading speed, implying three tokens per second. Sleeping and many other daily activities of course involve no tokens per second. Overall, one language token per second is a reasonable rough estimate of what a child observes. At this rate, 31 years must pass before they observe a billion tokens. Yet speculations about GPT-4 suggest four orders of magnitude more than a human observes in the process of learning. Closing this language learning efficiency gap (or more generally, sample efficiency gap) can have significant impact at multiple scales:

Large models: Organizations have already scraped most of the internet and exhausted natural sources for high-quality tokens (e.g., arXiv, Wikipedia). To continue improving the largest models, better sample efficiency may be required.
Small models: In spite of significant advances, further improvements to sample efficiency may be required if we want small language models (e.g., at the 3B scale) to reach the same level of performance as frontier models like GPT-4.

There are common arguments against the existence of a language efficiency gap which appear unconvincing.

Maybe a better choice of tokens is all you need?

This can’t be entirely ruled out, but the Phi series was an effort in this direction with the latest model trained on 10T tokens, implying there’s still a four-orders-of-magnitude efficiency gap between a human and a model which is still generally weaker than GPT-4 along most axes. It is possible that more sophisticated interactive data collection approaches could help close this gap, but this is largely unexplored.

Maybe language learning is evolutionary?

The chimpanzee-human split is estimated to have occurred between 5M and 13M years ago, resulting in a 35 million base-pair difference. The timeline for the appearance of language is estimated to have occurred between 2.5M and 150K years ago. Estimating divergence at 10M years ago and language occurring 1M years ago, with a stable rate of evolution on both sides. This suggests a crude upper bound of 35M/10/2 = 1.75M base pairs (or, around 3.5M bits) on the number of DNA bits encoding language inheritance. That’s around 5 orders of magnitude less than the number of parameters in a modern LLM, so this is not a viable explanation for the language efficiency gap.

On the other hand, it could be the case that the evolutionary lineage of humans evolved most language precursors long before actual language. The human genome has about 3.1B base pairs, with about one-third of proteins primarily expressed in the brain. Using an estimate of 1B base pairs (around 2B bits) that are brain related. This is still around two orders of magnitude smaller than the LLMs in use today, so it’s not a viable explanation for the language learning efficiency gap. It is plausible that the structure of neurons in a human brain, which strongly favors sparse connections over the dense connections favored by a GPU, are advantageous for learning purposes.

Maybe human language learning is accelerated by multimodality?

Humans benefit from a rich sensory world, notably through visual perception, which extends far beyond language. Estimating a “token count” for this additional information is difficult, however. For example, if someone is reading a book at 6.5 tokens per second, are they benefiting from all the extra sensory information? A recent paper puts the rate at which information is consciously processed in a human brain at effectively 10 bits/second which is only modestly more than the cross entropy of a language model. More generously, we could work from the common saying that “a picture is worth a thousand words” which is not radically different from techniques for encoding images into transformers. Using this, we could estimate that extra modalities increase the number of tokens by three orders of magnitude, resulting in 1T tokens observed by age 31. Given this, there is still an order-of-magnitude learning efficiency gap between humans and language models of the same class as GPT-4.

Maybe the learning efficiency gap does not matter?

In some domains, it may be possible to overcome the inefficiencies of a learning architecture by simply gathering more and more data as needed. At a scientific level, this is not a compelling argument, since understanding the fundamental limits of what is possible is the core purpose of science. Hence, this is a business argument, which may indeed be valid in some cases. A business response is that learning efficiency matters in domains where it is difficult or impossible to collect sufficient data: think of robot demonstrations, personalizing models, problems with long range structure, a universal translator encountering a new language, and so on. In addition, improving learning efficiency may lead to improvement in other forms of efficiency (e.g., memory and compute) via architectural improvements.

Model size: The representational efficiency gap

A second direction in which transformer-based models can be improved lies in model size, or representational efficiency. This is perhaps best illustrated by considering the problem of designing models or agents capable of physical or animal-like intelligence. This includes capabilities like 1) understanding one’s environment (perception); 2) moving oneself around (locomotion); and 3) planning to reach goals or accomplish basic tasks (e.g., navigation and manipulation). Naturally, this is very relevant if our goal is to build foundation models for embodied decision making.

The Portia spider has a brain one million times smaller than that of a human, yet it is observed to plan a course of action and execute it successfully over durations as long as an hour. Stated another way, it is possible to engage in significant physical intelligence behavior with 100M floats representing the neural connections and a modest gigaflop CPU capable of executing them in real time. This provides a strong case that much animal intelligence can be radically more representationally efficient than what has been observed in lingual domains, or yet implemented in software. A concrete question along these lines is:

Can we design a model with 100M floats that can effectively navigate and accomplish physical-intelligence tasks in the real world?

It is not clear whether there is an existing model of any size that can effectively do this. The most famous examples in this direction are game agents, which only function in relatively simple environments.

Are transformer models for language representationally inefficient?

While the discussion above concerns representational efficiency for physical intelligence, it is also interesting to consider representational efficiency for language. That is, are existing language models representationally efficient, or can similar capabilities be achieved with substantially smaller models? On the one hand, it is possible that language is an inherently complex process to both produce and understand. On the other hand, it might be possible to represent human level language in a radically more size-efficient manner, as in the case of physical intelligence.

To this end, one interesting example is given by Alex, a grey parrot that managed to learn and meaningfully use a modest vocabulary with a brain one-hundredth the size of a human brain by weight. If we accept the computational model of a neuron as a nonlinearity on a linear integration, Alex might have 1B neurons operating at 1T flops. Given Alex’s limited language ability, this isn’t constraining enough to decisively argue that language models that are substantially smaller than current models can be achieved. At the same time, it is plausible that most of Alex’s brain was not devoted to human language, offering some hope.

The long-term memory and planning gap

A third direction concerns developing models and agents suitable for domains that involve complex long-term interactions, which may necessitate the following capabilities:

Memory: Effectively summarizing the history of interaction into a succinct representation and using it in context.

Planning: Choosing the next actions or tokens deliberately to achieve a long range goal.

Recent advances like O1 and R1 handle relatively short range planning but are significant advancements in this vein. Existing applications of transformer language models largely avoid long-term interactions, since they can deviate from instructions. To highlight why we might expect to improve this situation, note that humans manage to engage in coherent plans over years-long timescales. Human-level intelligence isn’t required for this, though, as many animals exhibit behaviors that require long-timescale memory and planning. For example, a squirrel with a brain less than one-hundredth the size of a human brain stores food and reliably comes back to it after months of experience. Restated in a transformer-relevant way, a squirrel can experience billions of intervening (and potentially distracting) visual tokens before recalling the location of a cache of food and returning to it. How can we develop competitive models and agents with this capability?

Does it matter?

A common approach to circumvent memory and planning limitations of existing models is to create an outer-level executor that uses the LLM as a subroutine, combined with other tools for memory or planning systems. These approaches tacitly acknowledges the limits of current architectures by offering an alternative solution. Historically, as for machine vision or speech recognition, it has always been more difficult to create a learning system that accomplishes the task of interest with end-to-end training, but it was worthwhile when done as the results were superior. This pattern may repeat for long-term memory and planning, yielding better solutions.

An AI Miracle Malcontent

John Langford — Wed, 05 Apr 2023 21:44:38 +0000

The stark success of OpenAI’s GPT4 model surprised me shifting my view from “really good autocomplete” (roughly inline with intuitions here) to a dialog agent exhibiting a significant scope of reasoning and intelligence. Some of the MSR folks did a fairly thorough study of capabilities which seems like a good reference. I think of GPT4 as an artificial savant: super-John capable in some language-centric tasks like style and summarization with impressive yet more limited abilities in other domains like spatial and reasoning intelligence.

And yet, I’m unhappy with mere acceptance because there is a feeling that a miracle happened. How is this not a miracle, at least with hindsight? And given this, it’s not surprising to see folks thinking about more miracles. The difficulty with miracle thinking is that it has no structure upon which to reason for anticipation of the future, prepare for it, and act rationally. Given that, I wanted to lay out my view in some detail and attempt to understand enough to de-miracle what’s happening and what may come next.

Deconstructing The Autocomplete to Dialog Miracle
One of the ironies of the current situation is that an organization called “OpenAI” created AI and isn’t really open about how they did it. That’s an interesting statement about economic incentives and focus. Nevertheless, back when they were publishing, the Instruct GPT paper suggested something interesting: that reinforcement learning on a generative model substrate was remarkably effective—good for 2 to 3 orders of magnitude improvement in the quality of response with a tiny (in comparison to language sources for next word prediction) amount of reinforcement learning. My best guess is that this was the first combination of 3 vital ingredients.

Learning to predict the next word based on vast amounts of language data from the internet. I have no idea how much, but wouldn’t be surprised if it’s a million lifetimes of reading generated by a billion people. That’s a vast amount of information there with deeply intermixed details about the world and language.
1. Why not other objectives? Well, they wanted something simple so they could maximize scaling. There may indeed be room for improvement in choice of objective.
2. Why language? Language is fairy unique amongst information in that it’s the best expression of conscious thought. There is thought without language (yes, I believe animals think in various ways), but you can’t really do language without thought.
The use of a large deep transformer model (pseudocode here) to absorb all of this information. Large here presumably implies training on many GPUs with both data and model parallelism. I’m sure there are many fine engineering tricks here. I’m unclear on the scale, but expect the answer is more than thousands and less than millions.
1. Why transformer models? At a functional level, they embed ‘soft attention’ (=ability to look up a value with a key in a gradient friendly way). At an optimization level, they are GPU-friendly.
2. Why deep? The drive to minimize word prediction error in the context of differentiable depth creates a pressure to develop useful internal abstractions.
Reinforcement learning on a small amount of data which ‘awakens’ a dialog agent. With the right prompt (=prefix language) engineering a vanilla large language model can address many tasks as the information is there, but it’s awkward and clearly not a general purpose dialog agent. At the same time, the learned substrate is an excellent representation upon which to apply RL creating a more active agent while curbing an inherited tendency to mimic internet flamebait.
1. Why reinforcement learning? One of the oddities of language is that there is more than one way of saying things. Hence, the supervised learning view that there is a right answer and everything else is wrong sets up inherent conflicts in the optimization. Hence, “reinforcement learning from human feedback” pairs inverse reinforcement learning to discover a reward function and basic reinforcement learning to achieve better performance. What’s remarkable about this is that the two-step approach is counter to the information processing inequality.

The overall impression that I’m left with is something like the “ghost of the internet”. If you ask the internet for the answer to a question on the best forum available and get an answer, it might be in the ballpark of as useful and as correct as that which GPT4 provides (notably, in seconds). Peter Lee’s book on the application to medicine is pretty convincing. There are pluses and minuses here—GPT4’s abstraction of language tasks like summarization and style appear super-human, or at least better than I can manage. For commonly discussed content (e.g. medicine) it’s fairly solid, but for less commonly discussed content (say, Battletech fan designs) it becomes sketchy as the internet gives out. There are obviously times when it errs (often egregiously in a fully confident way), but that’s also true in internet forums. I specifically don’t trust GPT4 with math and often find it’s reasoning and abstraction abilities shaky, although it’s deeply impressive that they exist at all. And driving a car is out because it’s a task that you can’t really describe.

What about the future?
There’s been a great deal about the danger of AI discussed recently, and quite a mess of misexpectations about where we are.

Is GPT4 and future variants the answer to [insert intelligence-requiring problem here]? GPT4 seems most interesting as a language intelligence. It’s clearly useful as an advisor or a brainstormer. The meaning of “GPT5” isn’t clear, but I would expect substantial shifts in core algorithms/representations are necessary for mastering other forms of intelligence like memory, skill formation, information gathering, and optimized decision making.
Are generative models the end of consensual reality? Human societies seem to have a systematic weakness in that people often prefer a consistent viewpoint even at the expense of fairly extreme rationalization. That behavior in large language models is just looking at our collective behavior through a mirror. Generative model development (both language and video) do have a real potential to worsen this. I believe we should be making real efforts as a society to harden and defend objective reality in a multiple ways. This is not specifically about AI, but it would address a class of AI-related concerns and improve society generally.
Is AI about to kill everyone? Yudkowski’s editorial gives the impression that a Terminator style apocalypse is just around the corner. I’m skeptical about the short term (the next several years), but the longer term requires thought.
1. In the short term there are so many limitations of even GPT4 (even though it’s a giant advance) that I both lack the imagination to see a path to “everyone dies” and I expect it would be suicidal for an AI as well. GPT4, as an AI, is using the borrowed intelligence of the internet. Without that source it’s just an amalgamation of parameters of no interesting capabilities.
2. For the medium term, I think there’s a credible possibility that drone warfare becomes ultralethal inline with this imagined future. You can already see drone warfare in the Ukraine-Russia war significantly increasing the lethality of a battlefield. This requires some significant advances, but nothing seems outlandish. Counterdrone technology development and limits on usage inline with other war machines seems prudent.
3. For the longer term, Vinge’s classical singularity essay is telling here as he lays out the inevitability of developing intelligence for competitive reasons. Economists are often fond of pointing out how job creation has accompanied previous mechanization induced job losses and yet my daughter points out how we keep increasing the amount of schooling children must absorb to be capable members of society. It’s not hard to imagine a desolation of jobs in a decade or two where AIs can simply handle almost all present-day jobs and most humans can’t skill-up to be economically meaningful. Our society is not prepared for this situation—it seems like a quite serious and possibly inevitable possibility. Positive models for a nearly-fully-automated society are provided by Star Trek and Iain Banks although science fiction is very far from a working proposal for a working society.
4. I’m skeptical about a Lawnmower Man like scenario where a superintelligence suddenly takes over the world. In essence, cryptographic barriers are plausibly real, even to a superintelligence. As long as that’s so, the thing to watch out for is excessive concentrations of power without oversight. We already have a functioning notion of super-human intelligence in organizational intelligence and are familiar with techniques for restraining organizational intelligence into useful-for-society channels. Starting with this and improving seems reasonable.

ICML 2021 Invited Speakers — ML for Science

Ameet — Mon, 19 Jul 2021 15:54:42 +0000

By: Stefanie Jegelka and Ameet Talwalkar (ICML21 Communication Chairs)

With ICML 2021 underway, we wanted to briefly highlight the upcoming invited talks. A general theme of the invited talks this year is “machine learning for science.” The Program Chairs (Marina Meila and Tong Zhang) have invited world-renowned scientists from various disciplines to discuss their problems and the corresponding machine learning challenges. By exposing the machine learning community to these fascinating problems, we hope that we can help to further expand the applicability of machine learning to a wide range of scientific domains.

Daphne Koller (Tuesday, July 20th at 8am PDT): Dr. Koller is a pioneer in the field of machine learning, and is currently the Founder and CEO of Insitro, which leverages machine learning for drug discovery. She was the Rajeev Motwani Professor of Computer Science at Stanford University, where she served on the faculty for 18 years. She was the co-founder, co-CEO and President of Coursera, and the Chief Computing Officer of Calico, an Alphabet company in the healthcare space. She received the MacArthur Foundation Fellowship in 2004, was awarded the ACM Prize in Computing in 2008, and was recognized as one of TIME Magazine’s 100 most influential people in 2012.
Xiao Cunde and Dahe Qin (Tuesday, July 20th at 8pm PDT): Dr. Cunde is a glaciologist and Deputy Director of the Institute of the Climate System, Chinese Academy of Meteorological Sciences. He has worked in the fields of polar glaciology and meteorology since 1997. His major research focus has been ice core studies relating to paleo-climate and paleo-environment, and present day cold region meteorological and glaciological processes that impact environmental and climatic changes. Dr. Qin is the Former Director of the China Meteorological Administration. He is a glaciologist and the first Chinese ever to cross the South Pole. He was a member of the 1989 International Cross South Pole Expedition and has published numerous ground-breaking articles, using evidence gathered from his Antarctic expeditions.
Esther Duflo (Wednesday, July 21st at 8am PDT): Dr. Duflo is the Abdul Latif Jameel Professor of Poverty Alleviation and Development Economics in the Department of Economics at MIT and a co-founder and co-director of the Abdul Latif Jameel Poverty Action Lab (J-PAL). In her research, she seeks to understand the economic lives of the poor, with the aim to help design and evaluate social policies. She has worked on health, education, financial inclusion, environment and governance. In 2019, she received a Nobel Prize in Economic Sciences “for their experimental approach to alleviating global poverty”. In particular, she and co-authors have introduced a new approach to obtaining reliable answers about the best ways to fight global poverty.
Edward Chang (Wednesday, July 21st at 8pm PDT): Dr. Chang is a Professor in the Department of Neurological Surgery at the UCSF Weill Institute for Neurosciences. He is a neurosurgeon and uses machine learning to understand brain functions. His research focuses on the brain mechanisms for speech, movement and human emotion. He co-directs the Center for Neural Engineering and Prostheses, a collaborative enterprise of UCSF and UC Berkeley. The center brings together experts in engineering, neurology and neurosurgery to develop state-of-the-art biomedical technology to restore function for patients with neurological disabilities such as paralysis and speech disorders.
Cecilia Clementi (Thursday, July 22nd at 8am PDT): Dr. Clementi is a Professor of Chemistry, and Chemical and Biomolecular Engineering, and Senior Scientist in the Center for Theoretical Biological Physics at Rice University, and an Einstein Fellow at FU Berlin. She researches strategies to study complex biophysical processes on long timescales, and she is an expert in the simulation of biomolecules using large-scale ML. Her group designs multiscale models, adaptive sampling approaches, and data analysis tools, and uses both data-driven methods and theoretical formulations.

To register for the conference and check out these talks, please visit: https://icml.cc/.

ALT Highlights – An Interview with Joelle Pineau

GautamKamath — Fri, 23 Apr 2021 14:06:29 +0000

Welcome to ALT Highlights, a series of blog posts spotlighting various happenings at the recent conference ALT 2021, including plenary talks, tutorials, trends in learning theory, and more! To reach a broad audience, the series will be disseminated as guest posts on different blogs in machine learning and theoretical computer science. John has been kind enough to host the first post in the series. This initiative is organized by the Learning Theory Alliance, and overseen by Gautam Kamath. All posts in ALT Highlights are indexed on the official Learning Theory Alliance blog.

The first post is an interview with Joelle Pineau, by Michal Moshkovitz and Keziah Naggita.

We would like you to meet Dr. Joelle Pineau, an astounding leader in AI, based in Montreal, Canada.

Name: Joelle Pineau

Institutions: Joelle Pineau is a faculty member at Mila and an Associate Professor and William Dawson Scholar at the School of Computer Science at McGill University, where she co-directs the Reasoning and Learning Lab. She is a senior fellow of the Canadian Institute for Advanced Research (CIFAR), a co-managing director of Facebook AI Research, and the Montreal, Canada lab director. Learn more information about Joelle here and her talk here.

Reinforcement Learning (RL)

How and why did you choose to work in reinforcement learning? What are the things that inspired you to choose health as a domain of application for your RL work?

I started working in reinforcement learning at the beginning of my PhD in robotics at CMU. Quite honestly, I was delighted by the elegance of the mathematical formulation. It also had some link to topics I studied previously (in supervised learning & in operations search). It was also useful for decision-making, which was complementary to state tracking & prediction, which was the topic studied by many other members of my lab at the time.

I started working on applications to health-care early in my career as a faculty at McGill. I was curious to explore practical applications, and found some colleagues in health-care who had some interesting decision-making problems with the right characteristics.

How would you recommend a newcomer enter the RL field? For RL researchers interested in safety, is there some literature you can recommend as a starting point?

Get familiar with the basic mathematical formalism & algorithm, try your hand at easy simulation cases. For RL and safety, the literature is very small and quite recent, so it’s easy enough to get started. Work on Constrained MDPs (Altman, 1999) is a good starting point. See also the work on Seldonian RL, by Phil Tomas and colleagues.

In your talk you mentioned applications of RL to different domains. What do you think is the main achievement of RL?

The AlphaGo result was very impressive! Recently, the work on using RL to control the flight of the Loon balloons is also quite impressive.

What are the big open problems in RL?

Efficient exploration continues to be a major challenge. Stability of learning, even when the data is non-stationary (e.g. due to policy change), is also very important to address. In my talk I also highlighted the importance of development methods for RL with responsible properties (safety, security, transparency, etc.) as a major open problem.

Collaborations

Based on your work in neurostimulation, it appears that people from different fields of expertise were involved.

Yes, this was a close collaboration between researchers in CS (my own lab) and researchers in neuroscience, with expertise in electrophysiology.

What advice would you give researchers in finding interdisciplinary collaborators?

This collaboration was literally started by me picking up the phone and calling a colleague in neuroscience to propose the project. I then wrote a grant proposal and obtained funding to start the project. More generally, these days it’s actually very easy for researchers in machine learning to find interdisciplinary collaborators. Giving talks, offering office hours, speaking to colleagues you meet in random events – I’ve had literally dozens of projects proposed to me in the last few years, from all sorts of disciplines.

What are some of the best ways to foster successful collaborations tackling work cutting across multiple disciplines?

Spend time understanding the problems from the point of view of your collaborator, and commit to solving *that* problem. Don’t walk in with your own hammer (or pre-selected set of techniques), and expect to find a problem to show-off your techniques. Genuine curiosity about the other field is very valuable! Don’t hesitate to read the literature – don’t expect your collaborator to share all the needed knowledge. Co-supervising a student together is also often an effective way of working closely together.

Academia, industry and everything in between

During the talk, you mentioned variance in freedom of research for theoreticians in industry versus academia. Could you elaborate more about this? Are there certain personality traits or characteristics more likely to make someone more successful in academia versus industry?

For certain more theoretical work, it can be a long time until the impact and value of the work is realized. This is perhaps harder to support in industry, which is better suited to appreciated shorter-term impact. Another big difference is that in Academia, professors work closely with students and junior researchers, and should expect to dedicate a good amount of time and energy to training & developing them (even if it means the work might move along a bit slower). In industry, a researcher will most often work with more senior researchers, and the project is likely to move along faster (also because no one is taking or teaching courses).

How do you balance leadership, for example, at FAIR, with students advising like at McGill, research [CIFAIR, FAIR, McGill, Mila], and personal life?

It’s useful to have clarity about your priorities. Don’t let other people dictate what these are – you should decide for yourself. And then spend your time according to this. I enjoy my work at FAIR a lot, I also really enjoy spending time with my grad students at McGill/Mila, and of course I really enjoy time with my family & friends. So I try to keep a good balance between all of this. I also try to be clear & transparent with other people about my availability & priorities, so they can plan accordingly.

What do you think distinguishes the mindset of an extraordinary researcher?

To be a strong researcher, it helps to be very curious, genuinely want to understand and find out new knowledge. The ability to find new connections between ideas, concepts, is also useful. For scientific research, you also need discipline and good methodology, and a commitment to deep understanding (rather than “proving” whatever hypothesis you hold). Frankly, I also don’t think we need to further cultivate the myth of the “extraordinary researcher”. Research is primarily a collective institution, where many people contribute, in ways small and big, and it is through this collective work that we achieve big discoveries and breakthroughs!

What is the Right Response to Employer Misbehavior in Research?

John Langford — Mon, 14 Dec 2020 20:28:29 +0000

I enjoyed my conversations with Timnit when she was in the MSR-NYC lab, so her situation has been on my mind throughout NeurIPS.

Piecing together what happened second-hand is always tricky, but Jeff Dean’s account and Timnit’s agree on a basic outline. Timnit and others wrote a paper for FAccT which was approved for submission by the normal internal review process, then later unapproved. Timnit threatened to leave unless various details about this unapproval were clarified. Google then declared her resigned.

The definition of resign makes it clear an employee does it, not an employer. Since that apparently never happened, this is a mischaracterized firing. It also seems quite credible that the unapproval process was highly unusual based on various reactions I’ve seen and my personal expectations of what researchers would typically tolerate.

This frankly looks bad to me and quite a number of other people. Aside from the plain facts, this is also consistent with racism and/or sexism given the roles of those involved. Google itself now faces a substantial rebellion amongst employees.

However, I worry about consequences to some of these reactions.

Some people suggest not reviewing papers from Google-based researchers. As a personal decision, this is making a program chair’s difficult job harder. As a communal decision, this would devastate the community since a substantial fraction are employed at Google. These people did not make this decision and many actively support Timnit there (at some risk to their job) so a mass-punishment approach seems deeply counterproductive.
Others have suggested that Google should not be a sponsor at major machine learning conferences. Since all of these are run as nonprofits, the lost grants will either be made up by increasing costs for everyone or reducing grants to students and diversity sponsorship. Reduced grants in particular seem deeply counterproductive.
Some have suggested that all industry research in general is bad. Industrial research varies substantially from place to place, perhaps much more so than in academia. As an example, Microsoft Research has no similar internal review process for publications. Overall, the stereotyping inherent in this view makes me uncomfortable and there are some real advantages to working in industry in terms of ability to concentrate on research or effecting real change.

It’s critical to understand that the strength of the research community is incredibly valuable to the community. It’s not hard to imagine a different arrangement where all industrial research is proprietary, with only a few major companies operating competitive internal research teams. This sort of structure exists in some other fields, often to the detriment of anyone other than a major company. Researchers at those companies can’t as easily switch jobs and researchers outside of those companies may lack the context to even contribute to the state of the art. The field itself progresses slower and in a more secretive way due to lack of sharing. Anticommunal acts based on mass ostracization or abandonment could shift our structure from the current relatively happy equilibrium where people from all over can participate, learn, and contribute towards a much worse situation.

This is not to say that there are no consequences. The substantial natural consequences of a significant moral-impacting event will play out regardless of anything else. The marketplace for top researchers is quite competitive so for many of them uncertainty about the feasibility of publication, the disposition and competence of senior leadership, or constraints on topics tips the balance towards other offers. That may be severe this year, since this all blew up as the recruiting season was launching and I expect it to last over many years unless some significant action is taken. In this sense, I expect all the competitors may be looking forward to recruiting more than they were previously and the cost of not resolving the conflict here in a better way may be much, much higher than just about any other course of action. This is not particularly hypothetical—I saw it play out over the years after the silicon valley lab was cut as the brain drain of other great researchers in competitive areas was severe for several years afterwards.

I don’t think a general answer to the starting question is possible, since it will always depend on circumstances. Even this instance is complex with actions that could cause unintuitive adverse impacts on unanticipated parts of our community or damage the community as a whole. I personally hope that the considerable natural consequences here form a substantial deterrent to misbehavior in the long term. Please think this through when considering your actions here.

Edits: tweaked conclusion wording a bit with advice from reshamas.

Experiments with the ICML 2020 Peer-Review Process

stiv — Tue, 01 Dec 2020 16:04:01 +0000

This post is cross-listed on the CMU ML blog.

The International Conference on Machine Learning (ICML) is a flagship machine learning conference that in 2020 received 4,990 submissions and managed a pool of 3,931 reviewers and area chairs. Given that the stakes in the review process are high — the careers of researchers are often significantly affected by the publications in top venues — we decided to scrutinize several components of the peer-review process in a series of experiments. Specifically, in conjunction with the ICML 2020 conference, we performed three experiments that target: resubmission policies, management of reviewer discussions, and reviewer recruiting. In this post, we summarize the results of these studies.

Resubmission Bias

Motivation. Several leading ML and AI conferences have recently started requiring authors to declare previous submission history of their papers. In part, such measures are taken to reduce the load on reviewers by discouraging resubmissions without substantial changes. However, this requirement poses a risk of bias in reviewers’ evaluations.

Research question. Do reviewers get biased when they know that the paper they are reviewing was previously rejected from a similar venue?

Procedure. We organized an auxiliary conference review process with 134 junior reviewers from 5 top US schools and 19 papers from various areas of ML. We assigned participants 1 paper each and asked them to review the paper as if it was submitted to ICML. Unbeknown to participants, we allocated them to a test or control condition uniformly at random:

Control. Participants review the papers as usual.

Test. Before reading the paper, participants are told that the paper they review is a resubmission.

Hypothesis. We expect that if the bias is present, reviewers in the test condition should be harsher than in the control.

Key findings. Reviewers give almost one point lower score (95% Confidence Interval: [0.24, 1.30]) on a 10-point Likert item for the overall evaluation of a paper when they are told that a paper is a resubmission. In terms of narrower review criteria, reviewers tend to underrate “Paper Quality” the most.

Implications. Conference organizers need to evaluate a trade-off between envisaged benefits such as the hypothetical reduction in the number of submissions and the potential unfairness introduced to the process by the resubmission bias. One option to reduce the bias is to postpone the moment in which the resubmission signal is revealed until after the initial reviews are submitted. This finding must also be accounted for when deciding whether the reviews of rejected papers should be publicly available on systems like openreview.net and others.

Details. http://arxiv.org/abs/2011.14646

Herding Effects in Discussions

Motivation. Past research on human decision making shows that group discussion is susceptible to various biases related to social influence. For instance, it is documented that the decision of a group may be biased towards the opinion of the group member who proposes the solution first. We call this effect herding and note that, in peer review, herding (if present) may result in undesirable artifacts in decisions as different area chairs use different strategies to select the discussion initiator.

Research question. Conditioned on a set of reviewers who actively participate in a discussion of a paper, does the final decision of the paper depend on the order in which reviewers join the discussion?

Procedure. We performed a randomized controlled trial on herding in ICML 2020 discussions that involved about 1,500 papers and 2,000 reviewers. In peer review, the discussion takes place after the reviewers submit their initial reviews, so we know prior opinions of reviewers about the papers. With this information, we split a subset of ICML papers into two groups uniformly at random and applied different discussion-management strategies to them:

Positive Group. First ask the most positive reviewer to start the discussion, then later ask the most negative reviewer to contribute to the discussion.

Negative Group. First ask the most negative reviewer to start the discussion, then later ask the most positive reviewer to contribute to the discussion.

Hypothesis. The only difference between the strategies is the order in which reviewers are supposed to join the discussion. Hence, if the herding is absent, the strategies will not impact submissions from the two groups disproportionately. However, if the herding is present, we expect that the difference in the order will introduce a difference in the acceptance rates across the two groups of papers.

Key findings. The analysis of outcomes of approximately 1,500 papers does not reveal a statistically significant difference in acceptance rates between the two groups of papers. Hence, we find no evidence of herding in the discussion phase of peer review.

Implications. Regarding the concern of herding which is found to occur in other applications involving people, discussion in peer review does not seem to be susceptible to this effect and hence no specific measures to counteract herding in peer-review discussions are needed.

Details. https://arxiv.org/abs/2011.15083

Novice Reviewer Recruiting

Motivation. A surge in the number of submissions received by leading ML and AI conferences has challenged the sustainability of the review process by increasing the burden on the pool of qualified reviewers. Leading conferences have been addressing the issue by relaxing the seniority bar for reviewers and inviting very junior researchers with limited or no publication history, but there is mixed evidence regarding the impact of such interventions on the quality of reviews.

Research question. Can very junior reviewers be recruited and guided such that they enlarge the reviewer pool of leading ML and AI conferences without compromising the quality of the process?

Procedure. We implemented a twofold approach towards managing novice reviewers:

Selection. We evaluated reviews written in the aforementioned auxiliary conference review process involving 134 junior reviewers, and invited 52 of these reviewers who produced the strongest reviews to join the reviewer pool of ICML 2020. Most of these 52 “experimental” reviewers come from the population not considered by the conventional way of reviewer recruiting used in ICML 2020.

Mentoring. In the actual conference, we provided these experimental reviewers with a senior researcher as a point of contact who offered additional mentoring.

Hypothesis. If our approach allows to bring strong reviewers to the pool, we expect experimental reviewers to perform at least as good as reviewers from the main pool on various metrics, including the quality of reviews as rated by area chairs.

Key findings. A combination of the selection and mentoring mechanisms results in reviews of at least comparable and on some metrics even higher-rated quality as compared to the conventional pool of reviews: 30% of reviews written by the experimental reviewers exceeded the expectations of area chairs (compared to only 14% for the main pool).

Implications. The experiment received positive feedback from participants who appreciated the opportunity to become a reviewer in ICML 2020 and from authors of papers used in the auxiliary review process who received a set of useful reviews without submitting to a real conference. Hence, we believe that a promising direction is to replicate the experiment at a larger scale and evaluate the benefits of each component of our approach.

Details. http://arxiv.org/abs/2011.15050

Conclusion

All in all, the experiments we conducted in ICML 2020 reveal some useful and actionable insights about the peer-review process. We hope that some of these ideas will help to design a better peer-review pipeline in future conferences.

We thank ICML area chairs, reviewers, and authors for their tremendous efforts. We would also like to thank the Microsoft Conference Management Toolkit (CMT) team for their continuous support and implementation of features necessary to run these experiments, the authors of papers contributed to the auxiliary review process for their responsiveness, and participants of the resubmission bias experiment for their enthusiasm. Finally, we thank Ed Kennedy and Devendra Chaplot for their help with designing and executing the experiments.

The post is based on the works by Ivan Stelmakh, Nihar B. Shah, Aarti Singh, Hal Daumé III, and Charvi Rastogi.

HOMER: Provable Exploration in Reinforcement Learning

DipendraMisra — Tue, 21 Jul 2020 16:59:11 +0000

Last week at ICML 2020, Mikael Henaff, Akshay Krishnamurthy, John Langford and I had a paper on a new reinforcement learning (RL) algorithm that solves three key problems in RL: (i) global exploration, (ii) decoding latent dynamics, and (iii) optimizing a given reward function. Our ICML poster is here.

The paper is a bit mathematically heavy in nature so this post is an attempt to distill the key findings. We will also be following up soon with a new codebase release (more on it later).

Rich-observation RL landscape

Consider the combination lock problem shown below. The agent starts in the state s_1a or s_1b with equal probability. After taking h-1 actions, the agent will be in either state s_ha, s_hb, or s_hc. The agent can take 10 different actions. The agent observes a high-dimensional observation (focus circle) instead of the underlying state which is latent. There is a big treasure chest that one can get after taking 100 actions. We view the states with subscript “a” or “b” as “good states” and one with subscript “c” as “bad states”. You can reach the treasure chest at the end only if you remain in good states. If you reach any bad state, then you can never make it to the treasure chest.

The environment makes it difficult to reach the big treasure chest in three ways. First, the environmental dynamics are such that if you are in good states, then only 1 out of 10 possible actions will let you reach the two good states at the next time step with equal probability (the good action changes from state to state). Every other action in good states and all actions in bad states put you into bad states at the next time step, from which it is impossible to recover. Second, it misleads myopic agents by giving a small bonus for transitioning from a good state to a bad state (small treasure chest). This means that a locally optimal policy is transitions to one of the bad states as quickly as possible. Third, the agent never directly observes which state it is in. Instead, it receives a high-dimensional, noisy observation from which it must decode the true underlying state.

It is easy to see that if we take actions uniformly at random, then the probability of reaching the big treasure chest at the end is 1/10¹⁰⁰. The number 10¹⁰⁰ is called Googol and is larger than the current estimate of number of elementary particles in the universe. Furthermore, since transitions are stochastic one can show that no fixed sequence of actions performs well either.

A key aspect of the rich-observation setting is that the agent receives observations instead of latent state. The observations are stochastically sampled from an infinitely large space conditioned on the state. However, observations are rich-enough to enable decoding the latent state which generates them.

What does provable RL mean?

A provable RL algorithm means that for any given numbers e, d in (0, 1); we can learn an e-optimal policy with probability at least 1-d using a number of episodes which are polynomial in relevant quantities (state size, horizon, action space, 1/e, 1/d, etc.). By e-optimal policy we mean a policy whose value (expected total return) is at most e less than the optimal return.

Thus, a provable RL algorithm is capable of learning a close to optimal policy with high probability (where the word high and close can be made arbitrarily more refined), provided the assumptions it makes are satisfied.

Why should I care if my algorithm is provable?

There are two main advantages of being able to show your algorithm is provable:

We can only test an algorithm on a finite number of environments (in practice somewhere between 1 and 20). Without guarantees, we don’t know how they will behave in a new environment. This matters especially if failure in a new environment can result in high real-world costs (e.g., in health or financial domains).
If a provable algorithm fails to consistently give the desired result, this can be attributed to failure of at least one of its assumptions. A developer can then look at the assumptions and try to determine which ones are violated, and either intervene to fix them or determine that the algorithm is not appropriate for the problem.

HOMER

Our algorithm addresses what is known as the Block MDP setting. In this setting, a small number of discrete states generates a potentially infinite number of high dimensional observations.

For each time step, HOMER learns a state decoder function, and a set of exploration policies. The state decoder maps high-dimensional observations to a small set of possible latent states, while the exploration policies map observations to actions which will lead the agent to each of the latent states. We describe HOMER below.

For a given time step, we first learn a decoder for mapping observations to a small set of values using contrastive learning. This procedure works as follows: collect a transition by following a randomly sampled exploration policy from the previous time step until that time step, and then taking a single random action. We use this procedure to sample two transitions shown below.

We then flip a coin; if we get heads then we store the transition (x1, a1, x’1), and otherwise we store the imposter transition (x1, a1, x’2). We train a supervised classifier to predict if a given transition (x, a, x’) is real or not.
This classifier has a special structure which allows us to recover a decoder for time step h.

Once we have learned the state decoder, we will learn an exploration policy for every possible value of the decoder (which we call abstract state as they are related to the latent state space). This step is standard can be done using many different approaches such as model-based planning, model-free methods, etc. In the paper we use an existing model-free algorithm called policy search by dynamic programming (PSDP) by Bagnell et al. 2004.

We recovered a decoder and a set of exploration policy for this time step. We then keep doing it for every time step and learn a decoder and exploration policy for the whole latent state space. Finally, we can easily optimize any given reward function using any provable planner like PSDP or a model-based algorithm. (The algorithm actually recovers the latent state space up to an inherent ambiguity by combining two different decoders; but I’ll leave that to avoid overloading this post).

Key findings

HOMER achieves the following three properties:

The contrastive learning procedure gives us the right state decoding (we recover up to some inherent ambiguity but I won’t cover it here).
HOMER can learn a set of exploration policies to reach every latent state
HOMER can learn a nearly-optimal policy for any given reward function with high probability. Further, this can be done after exploration part has been performed.

Failure cases of prior RL algorithms

There are many RL algorithms in the literature and many new are proposed every month. It is difficult to do justice to this vast literature in a blog post. It is equally difficult to situate HOMER in this vast literature. However, we show that several very commonly used RL algorithms fail to solve the above problem while HOMER succeeds. One of these is the PPO algorithm, a widely used policy gradient algorithm. In spite of its popular use, PPO is not designed for challenging exploration problems and easily fails. Researchers have made efforts to alleviate this with ad-hoc proposals such as using prediction errors, counts based on auto-encoders, etc. The best alternative approach we found is called Random Network Distillation(RND) which measures novelty of a state based on prediction errors for a fixed randomly initialized network.

Below we show how PPO+RND fails to solve the above problem while HOMER succeeds. We simplify the problem by using a grid pattern where rows represent the state (the top two represents “good” states and bottom row represents “bad” states), and column represents timestep.

https://youtu.be/tjxl4kpd7Uw

We present counter-examples for other algorithms in the paper (see Section 6 here). These counterexamples allow us to find limits of prior work without expensive empirical computation on many domains.

How can I use with HOMER?

We will be providing the code soon as part of a new package release called cereb-rl. You can find it here: https://github.com/cereb-rl and join the discussion here: https://gitter.im/cereb-rl

Critical issues in digital contract tracing

John Langford — Sun, 19 Apr 2020 03:00:33 +0000

I spent the last month becoming a connoisseur of digital contact tracing approaches since this seems like something where I might be able to help. Many other people have been thinking along similar lines (great), but I also see several misconceptions that even smart and deeply involved people are making.

For the following a key distinction to understand is between proximity and location approaches. In proximity approaches (such as DP3T, TCN, MIT PACT(*), Apple or one of the UW PACT(*) protocols which I am involved in) smartphones use Bluetooth low energy and possibly ultrasonics to discover other smartphones nearby. Location approaches (such as MIT Safe Paths or Israel) instead record the absolute location of the device based on gps, cell tower triangulation, or wifi signals.

Location traces are both poor quality and intrinsically identifying
Many people associate the ability of a phone to determine where it is with the ability to discover where it is with high precision. This is typically incorrect. Common healthcare guidance for possible contact is “within 2 meters for 10 minutes” while location data is often off by 10-100 meters, with varying accuracy due to which location methodology is in use. As an example, approximately everyone in Manhattan may be within 100 meters of someone who later tested positive for COVID-19. Given this inaccuracy, I expect users of a system based on location crossing to simply turn them off due to the large number of false positives.

These location traces, even though they are crude, are also highly identifying. When going about your normal pre-pandemic life, you move from location X to Y to Z. Typically no one else goes from X to Y to Z in the same timeframe (clocks are typically very accurate). If you test positive and make your trace available to help suppress the virus, a store owner with a video camera and a credit card record might de-anonymize you and accuse you of killing someone they care about. Given the stakes here, preserving as much anonymity as possible is critical for convincing people to release the information which is needed to control the virus.

Given this, approaches which upload the location data of users seem likely to have reduced adoption and many false positives. While some governments are choosing to use all location data on an involuntary basis like Israel, the lack of effectiveness compared to proximity based approaches and the draconian compromise of civil liberties are worrisome.

Location traces can be useful in a privacy-preserving way
Understanding the above, people often conclude that location traces are subsumed by alternatives. That’s not true. Location approaches can be made very private by simply never allowing a location trace leave the personal device. While this might feel contradictory to epidemiological success, it’s actually extremely helpful in at least two ways.

People have a pretty poor memory, so when they test positive and someone calls them up to do a contact tracing interview, having a location trace on their phone can be incredibly useful in jogging their memory. Using the location trace this way allows the manual contact tracing process to be much more complete. It can also be made much faster by allowing infected people to prefill much of a contact interview form before they get a call.
The virus is inherently very localized, so public health authorities often want to quickly talk to people at location X or warn people to stay away from location Y until it is cleaned. This can be strongly enabled by on-device location traces. The phone can download all the public health messages in a region and check automatically which are relevant to the phone’s location trace, surfacing those as needed to the user. This provides more power than crossing location traces. A message of “If you were at store X on April 16th, please email w@y.z” allows people to not respond if they went to store V next door.

Both of these capabilities are a part of the UW PACT protocols I worked on for this reason.

Proximity-only approaches have an x² problem

When people abandon location-based approaches, it’s in favor of proximity-based approaches. For any proximity protocol approach to work, both the infected person and the contact must be running the protocol implying there are two ways for it to fail to be useful.

To get a sense of what is necessary, consider the reproduction number of the coronavirus. Estimates vary but a reproduction number of 2.5 is reasonable. That is, the virus might infect 2.5 new people per infected person on average in the absence of countermeasures. To keep an infection with a base reproduction number of 2.5 from exponentiating, it is necessary to reduce the reproduction number to 1 which can be done when 60% of contacts are discovered, assuming (optimistically) no testing error and perfect isolation of discovered contacts before they infect anyone else.

To reach 60% you need 77.5% of people to download and run proximity protocols. This is impossible in many places where smartphones are owned by fewer than 77.5% of the population. Even in places where it’s possible it’s difficult to imagine reaching that level of usage without it being a mandatory part of the operating system that you are forced to use. Even then, subpopulations without smartphones are not covered. The square problem gets worse at lower levels of adoption. At 10% adoption (which corresponds to a hugely popular app), only 1% of contacts can be discovered via this mechanism. Despite the smallness, informing 1% of contacts does have real value in about the same sense that if someone loaned you money with a 1%/week interest rate you would call them a loan shark. At the same time, this is only 1/60th of a solution to getting the reproduction number below 1.

Hence, people advocating for proximity approaches must either hope for pervasive mandatory use (which will still miss subcommunities without smartphones) or accept that proximity approaches are only a part of the picture.

This quadratic structure also implies that the number of successful proximity tracing protocols will be either 0 or 1 in any geographic region. Given that Apple/Google are building a protocol into their OSes, that’s the candidate for the possible 1 in most of the world once it becomes available(**).

This quadratic structure is difficult to avoid. For example, if location traces are crossed with location traces, the same issue comes up. Similarly for proximity tracing, you could imagine recording “wild” bluetooth beacons and then reporting them to avoid the quadratic structure. This however unavoidably reveals contacts publicly which can then cause the positive person to be revealed publicly.

Interestingly, traditional manual contact tracing does not suffer from the quadratic problem. Hence approaches (discussed above) which augment and benefit from manual contact tracing have a linear value structure, which matters enormously with lower levels of adoption.

What works?
The primary thrust of contract tracing needs to be manual, as that is what has worked in countries (like South Korea) which suppressed large outbreaks. Purely digital approaches don’t seem like a credible solution due to issues discussed above. Hybrid approaches with smartphone-based apps can help by complementing manual contact tracing and perhaps via proximity approaches. Getting there requires high levels of adoption, which implies trust is a critical commodity. In addition to navigating the issues above, projects need to be open source, voluntary, useful, and strongly respect privacy (the ACLU recommendations are good here). This is what the CovidSafe project is aimed at in implementing the UW PACT protocols. Projects not navigating the above issues as well are less credible in my understanding.

An acknowledgement: many people have affected my thinking through this process, particularly those on the UW PACT paper and CovidSafe projects.

(*) I have no idea how the name collision occurred. We started using PACT here, 3 weeks ago, and circulated drafts to many people including a subset of the MIT PACT group before putting it on arxiv.

(**) The Apple protocol is a bit worrisome as development there is not particularly open and I have a concern about the crypto protocol. The Tracing Key on page 5, if acquired via hack or subpeona, allows you to prove the location of a device years after the fact. This is not epidemiologically required and there are other protocols without this weakness. Edit: The new version of their protocol addresses this issue.

What is the most effective policy response to the new coronavirus pandemic?

John Langford — Tue, 17 Mar 2020 18:45:02 +0000

Disclaimer: I am not an epidemiologist, but there is an interesting potentially important pattern in the data that seems worth understanding.

World healthcare authorities appear to be primarily shifting towards Social Distancing. However, there is potential to pursue a different strategy in the medium term that exploits a vulnerability of this disease: the 5 day incubation time is much longer than a 4 hour detection time. This vulnerability is real—it has proved exploitable at scale in South Korea and in China outside of Hubei.

Exploiting this vulnerability requires:

A sufficient capacity of rapid tests be available. Sufficient here is perhaps 30 times the number of true new cases per day based on South Korea’s testing rate.
The capacity to rapidly trace the contacts of confirmed positive cases. This is both highly labor intensive and absurdly cheap compared to shutting down the economy.
Effective quarantining of positive and suspect cases. This could be in home, with the quarantine extended to the entire family. It could also be done in a hotel (… which are pretty empty these days), or in a hospital.

Where Test/Trace/Quarantine are working, the number of cases/day have declined empirically. Furthermore, this appears to be a radically superior strategy where it can be deployed. I’ll review the evidence, discuss the other strategies and their consequences, and then discuss what can be done.

Evidence for Test/Trace/Quarantine
The TTQ strategy works when it effectively catches a 1 – 1 / reproduction number fraction of cases. The reproduction number is not precisely known although discovering 90% of cases seems likely effective and 50% of cases seems likely ineffective based on public data.

How do you know what fraction of cases are detected? A crude measure can be formed by comparing detected cases / mortality across different countries. Anyone who dies from pneumonia these days should be tested for COVID-19 so the number of deaths is a relatively trustworthy statistic. If we suppose the ratio of true cases to mortality is fixed, then the ratio of observed cases to mortality allows us to estimate the fraction of detected cases. For example, if the true ratio between infections and fatalities is 100 while we observe 30, then the detection rate is 30%.

There are many caveats to this analysis (see below). Nevertheless, this ratio seems to provide real information which is useful in thinking about the future. Drawing data from the Johns Hopkins COVID-19 time series, and plotting we see:

The arrows here represent the progression of time by days with time starting at the first recorded death. The X axis here is the ratio between cumulative observed cases and cumulative observed deaths. Countries that are able and willing to test widely have progressions on the right while those that are unable or unwilling to test widely are on the left. Note here that the X axis is on a log scale allowing us to see small variations in the ratio when the ratio is small and large variations in the ratio when the ratio is large.

The Y axis here is the number of cases/day. For a country to engage in effective Test/Trace/Quarantine, it must effectively test, which the X axis is measuring. Intuitively, we expect countries that test effectively to follow up with Trace and Quarantine, and we expect this to result in a reduced number of cases per day. This is exactly what is observed. Note that we again use a log scale for the Y axis due to the enormous differences in numbers.

There are several things you can read from this graph that make sense when you consider the dynamics.

China excluding Hubei and South Korea had outbreaks which did not exceed the hospital capacity since the arrows start moving up and then loop back down around a 1% fatality rate.
The United States has a growing outbreak and a growing testing capacity. Comparing with China-excluding-Hubei and South Korea’s outbreak, only a 1/4-1/10th fraction of the cases are likely detected. Can the United States expand capacity fast enough to keep up with the growth of the epidemic?
Looking at Italy, you can see evidence of an overwhelmed healthcare system as the fatality rate escalates. There is also some hope here, since the effects of the Italian lockdown are possibly starting to show in the new daily cases.
Germany is a strange case with an extremely large ratio. It looks like there is evidence that Germany is starting to control their outbreak, which is hopeful and aligned with our expectations.

The creation of this graph is fully automated and it’s easy to graph things for any country in the Johns Hopkins dataset. I created a github repository with the code. Feel free to make fun of me for using C++ as a scripting language

You can also understand some of the limitations of this graph by thinking through the statistics and generation process.

Mortality is a delayed statistic. Apparently, it’s about a week delayed in the case of COVID-19. Given this, you expect to see the ratio generate loops when an outbreak occurs and then is controlled. South Korea and China-excluding-Hubei show this looping structure, returning to a ratio of near 100.
Mortality is a small statistic, and a small statistic in the denominator can make the ratio unstable. When mortality is relatively low, we expect to see quite a variation. Checking each progression, you see wide ratio variations initially, particularly in the case of the United States.
Mortality may vary from population to population. It’s almost surely dependent on the age distribution and health characteristics of the population and possibly other factors as well. Germany’s ratio is notably large here.
Mortality is not a fixed variable, but rather dependent on the quality of care. A reasonable approximation of this is that every “critical” case dies without intensive care support. Hence, we definitely do not expect this statistic to hold up when/where the healthcare system is overwhelmed, as it is in Italy. This is also the reason why I excluded Hubei from the China data.

Lockdown
The only other strategy known to work is a “lockdown” where nearly everyone stays home nearly all the time, as first used in Hubei. This characterization is simplistic—in practice such a quarantine comes with many other measures as well. This can work very effectively—today the number of new case in Hubei is in the 10s.

The lockdown approach shuts down the economy fast and hard. Most people can’t work, so they can’t make money, so they can’t buy things, so the people who make things can’t make money, so they go broke, etc… This is strongly reflected in the stock market’s reaction to the escalating pandemic. If the lockdown approach is used for long most people and companies are destined for bankruptcy. If a lockdown approach costs 50% of GDP then a Test/Trace/Quarantine approach costing only a few% of GDP seems incredibly cheap in comparison.

The lockdown approach is also extremely intrusive. It’s akin to collective punishment in that it harms the welfare of everyone, regardless of their disease status. Many peoples daily lives fundamentally depend on moving around—for example people using dialysis.

Despite this, the lockdown approach is being taken up everywhere that cases are overwhelming or threaten to overwhelm hospitals because the alternative (next) is even worse. One advantage that a lockdown approach has is that it can be used now while the Test/Trace/Quarantine approach requires more organizing. It’s the best bad option when the Test/Trace/Quarantine capacity is exceeded or to bridge the time until it becomes available.

If/when/where Test/Trace/Quarantine becomes available, I expect it to be rapidly adopted. This new study (page 11) points out that repeated lockdowns are close to permanent lockdowns in effect.

Herd Immunity
Some countries have considered skipping measures to control the virus on the theory that the population eventually acquires enough people with individual immunity after recovery so the disease dies out. This approach invites severe consequences.

A key issue here is: How bad is the virus? The mortality rate in China excluding Hubei and South Korea is only about 1%. From this, some people appear to erroneously reason that the impact of the virus is “only” having 1% of 50% of the population die, heavily weighted towards older people. This reasoning is fundamentally flawed.

The mortality rate is not a fixed number, but rather dependent on the quality of care. In particular, because most countries have very few intensive care units, an uncontrolled epidemic effectively implies all but a vanishing fraction of sick people only benefit from home stay quality of care. How many people could die with home stay quality of care? Essentially everyone who would otherwise require intensive care at a hospital. In China, that meant 6.1% (see page 12). Given this, the sound understanding is that COVID-19 generates a factor 2-3 worse mortality than the 1918 influenza pandemic where modern healthcare might make this instead be half as bad when not overwhelmed. Note here that the fatality rate in Hubei (4.6% of known cases, which might be 3% of total cases) does not fully express how bad this would be due to the fraction of infected people remaining low and a surge of healthcare support from the rest of China.

The herd immunity approach also does not cause the disease to die out—instead it continues to linger in the population for a long time. This means that people traveling from such a country will be effectively ostracized by every country (like China or South Korea) which has effectively implemented a Test/Trace/Quarantine approach.

I’ve avoided discussing the ethics here since people making this kind of argument may not care about ethics. For everyone else it’s fair to say that letting part of the population die to keep the economy going is anathema. My overall expectation is that governments pursuing this approach are at serious risk of revolt.

Vaccine

Vaccines are extremely attractive because they are a very low cost way to end the pandemic. They are however uncertain and take time to develop and test, so they are not a viable strategy for the next few months.

What can be done?

Public health authorities are generally talking about Social Distancing. This is plausibly the best general-public message because everyone can do something to help here.

It’s also clear that healthcare workers, vaccines makers, and everyone supporting them have a critical role to play.

But, perhaps there’s a third group that can really help? Perhaps there are people who can help scale up the Test/Trace/Quarantine approach so it can be rapidly adopted? Natural questions here are:

How can testing be scaled up rapidly—more rapidly than the disease? This question is already getting quite a bit of attention, and deservedly so.
How can tracing be scaled up rapidly and efficiently? Hiring many people who are freshly out of work is the most obvious solution. That could make good sense given the situation. However, automated or partially automated approaches have the potential to greatly assist as well. I hesitate to mention cell phone tracking because of the potential for abuse, but can that be avoided while still gaining the potential public health benefits?
How can quarantining be made highly precise and effective? Can you estimate the risk of infection with high precision? What support can safely be put in place to help those who are quarantined? Can we avoid the situation where the government says “you should quarantine” and “people in quarantine can’t vote”?

Some countries started this pandemic setup for relatively quick scaleup of the Test/Trace/Quarantine. Others, including the United States, seem to have been unprepared. Nevertheless, I am still holding out hope that the worst case scenarios (high mortality or months-long lockdowns) can be largely avoided as the available evidence suggests that this is certainly possible. Can we manage to get the number of true cases down (via a short lockdown if necessary) to the point where an escalating Test/Trace/Quarantine approach can take over?

Edit: I found myself remaking the graph for myself personally so I made it update hourly and added New York (where I live).

Coronavirus and Machine Learning Conferences

John Langford — Sun, 23 Feb 2020 23:15:47 +0000

I’ve been following the renamed COVID-19 epidemic closely since potential exponentials deserve that kind of attention.

The last few days have convinced me it’s a good idea to start making contingency plans for machine learning conferences like ICML. The plausible options happen to be structurally aligned with calls to enable reduced travel to machine learning conferences, but of course the need is much more immediate.

I’ll discuss relevant observations about COVID-19 and then the impact on machine learning conferences.

COVID-19 observations

COVID-19 is capable of exponentiating with a base estimated at 2.13-3.11 and a doubling time around a week when unchecked.
COVID-19 is far more deadly than the seasonal flu with estimates of a 2-3% fatality rate but also much milder than SARS or MERS. Indeed, part of what makes COVID-19 so significant is the fact that it is mild for many people leading to a lack of diagnosis, more spread, and ultimately more illness and death.
COVID-19 can be controlled at a large scale via draconian travel restrictions. The number of new observed cases per day peaked about 2 weeks after China’s lockdown and has been declining for the last week.
COVID-19 can be controlled at a small scale by careful contact tracing and isolation. There have been hundreds of cases spread across the world over the last month which have not created new uncontrolled outbreaks.
New significant uncontrolled outbreaks in Italy, Iran, and South Korea have been revealed over the last few days. Some details:
1. The 8 COVID-19 deaths in Iran suggests that the few reported cases (as of 2/23) are only the tip of the iceberg.
2. The fact that South Korea and Italy can suddenly discover a large outbreak despite heavy news coverage suggests that it can really happen anywhere.
3. These new outbreaks suggest that in a few days COVID-19 is likely to become a world-problem with a declining China aspect rather than a China-problem with ramifications for the rest of the world.

There remains quite a bit of uncertainty about COVID-19, of course. The plausible bet is that the known control measures remain effective when and where they can be exercised with new ones (like a vaccine) eventually reducing it to a non-problem.

Conferences
The plausible scenario leaves conferences still in a delicate position because they require many things go right to function. We can easily envision 3 quite different futures here consistent with the plausible case.

Good case New COVID-19 outbreaks are systematically controlled via proven measures with the overall number of daily cases declining steadily as they are right now. The impact on conferences is marginal with lingering travel restrictions affecting some (<10%) potential attendees.
Poor case Multiple COVID-19 outbreaks turn into a pandemic (=multi-continent epidemic) in regions unable to effectively exercise either control measure. Outbreaks in other regions occur, but they are effectively controlled. The impact on conferences is significant with many (50%?) avoiding travel due to either restrictions or uncertainty about restrictions.
Bad case The same as (2), except that an outbreak occurs in the area of the conference. This makes the conference nonviable due to travel restrictions alone. It’s notable here that Italy’s new outbreak involves travel lockdowns a few hundred miles/kilometers from Vienna where ICML 2020 is planned.

Even the first outcome could benefit from some planning while gracefully handling the last outcome requires it.

The obvious response to these plausible scenarios is to reduce the dependence of a successful conference on travel. To do this we need to think about what a conference is in terms of the roles that it fulfills. The quick breakdown I see is:

Distilling knowledge. Luckily, our review process is already distributed.
Passing on knowledge.
Meeting people, both old friends and discovering new ones.
Finding a job / employee.

How (and which) of these can be effectively supported remotely?

I’m planning to have discussions over the next few weeks about this to distill out some plans. If you have good ideas, let’s discuss. Unlike most contingency planning, it seems likely that efforts are not wasted no matter what the outcome