Regretting the dead

Nikos pointed out this new york times article about poor clinical design killing people. For those of us who study learning from exploration information this is a reminder that low regret algorithms are particularly important, as regret in clinical trials is measured by patient deaths.

Two obvious improvements on the experimental design are:

  1. With reasonable record keeping of existing outcomes for the standard treatments, there is no need to explicitly assign people to a control group with the standard treatment, as that approach is effectively explored with great certainty. Asserting otherwise would imply that the nature of effective treatments for cancer has changed between now and a year ago, which denies the value of any clinical trial.
  2. An optimal experimental design will smoothly phase between exploration and exploitation as evidence for a new treatment shows that it can be effective. This is old tech, for example in the EXP3.P algorithm (page 12 aka 59) although I prefer the generalized and somewhat clearer analysis of EXP4.P.

Done the right way, the clinical trial for a successful treatment would start with some initial small pool (equivalent to “phase 1” in the article) and then simply expanded the pool of participants over time as it proved superior to the existing treatment, until the pool is everyone. And as a bonus, you can even compete with policies on treatments rather than raw treatments (i.e. personalized medicine).

Getting from here to there seems difficult. It’s been 15 years since EXP3.P was first published, and the progress in clinical trial design seems glacial to us outsiders. Partly, I think this is a communication and education failure, but partly, it’s also a failure of imagination within our own field. When we design algorithms, we often don’t think about all the applications, where a little massaging of the design in obvious-to-us ways so as to suit these applications would go a long ways. Getting this right here has a substantial moral aspect, potentially saving millions of lives over time through more precise and fast deployments of new treatments.

30 Replies to “Regretting the dead”

  1. I think this is important information to communicate to the larger world. How can we get the message out?

    1. Someone like Michael Littman’s wife might be important here. She probably knows the relevant conferences, and who to contact—an invited talk on the subject at a medical conference might be an effective start.

  2. I don’t think the need for a control group can be dismissed so easily. Clinical trials are often not representative of how a drug would be used in reality: patients are pre-screened and those with complicated diagnoses are often excluded; patients may receive more attention and adherence to the regimen may be more strict; and of course there is the placebo effect.

    1. If you read the article, you’ll notice that patients knew which group they were in here, and hence placebo effect was not controlled for. But more generally, placebo effects can be substantial, at least for some diseases. This means that the first time something is treated, it’s probably important to have a control group. And for the nth time, the control group is displaced in time to a year earlier.

      Your point about prescreening is correct. It’s essential that this screening be done based upon standardized information collection to do the time displaced control group properly.

      The difference in adherence and attention is something which should probably be addressed by clinical trail design, as it’s best for society to have a clinical trial reflect actual use.

      Overall, I understand there are various challenges in implementation, but I see no fundamental obstacle to overcoming them.

      1. I disagree: the challenges lie not with implementation, but at the fundamental philosophical level of what constitutes causality: that is, does this drug cause the disease to be cured? At the moment, the only widely accepted way of identifying causal effects is through the use of randomisation and interventions in properly controlled trials. There is exciting work being done in this area, but we haven’t yet found a “grand theory of causal inference”.

        I’m not saying that trial design couldn’t be improved: judging by purely by the anecdotal evidence in the NYT article, it seems that the effect size may have been large enough that a properly implemented early stopping criteria could have allowed the trial to be finished at an earlier date, and the control patients could then switched to the treatment (as was done with the HIV circumcision study a couple of years ago).

        I think the big game-changer will be the advent of electronic medical records: the AI/ML/stats community will have available huge amounts of messy and complicated observational data, which will hopefully allow for studies which are infeasible in the current trial context, such as drug interactions and off-label use. David Madigan is already using this type of data for pharmacovigilance studies.

        It is also worth keeping in mind the flipside: drug companies and consumers have billions of dollars at stake, and the potential risk of approving an ineffective or dangerous drug can have significant financial and human consequences (see the recent Avastin controversy). There are already significant problems with cherry picking of data (see, the potential for the cherry picking of models/algorithms is also another factor to keep in mind.

        1. When the world acts according to an IID structure on (state,potential outcomes), it’s valid to exchange explicit randomness in an experimental design for randomness in the world via oblivious deterministic decisions. This is discussed in the first half of the exploration scavenging paper in a machine learning context. In the context of a clinical trial, the IID assumption corresponds to a belief that the nature of cancer and the effects of treatment do not change over time (identicality) and that the outcomes for one patient are independent of the outcome for another patient conditioned on the state (independence). Both seem pretty reasonable for skin cancer. The oblivious act is treating with the existing treatment before the new one becomes available. This IID structure is precisely what allows chemists, biologists, and physicists to compare two experiments done sequentially rather than with explicit randomization, so at least for many people this is accepted practice. For the rest, perhaps there is some education gap or perhaps they simply haven’t thought about how reasonable the IID assumption is here.

          I think everyone is in agreement that better data can make a pretty huge difference here. And, I agree that we should be vigilant about the weaknesses of clinical trials in practice as well with the “do 20 trials and report the best outcome” being the most obvious abuse method. But again, these both seem like challenges that can be overcome.

          1. My point in the first post is that IID is NOT reasonable in this context: there can be significant between trial variation, which in some cases can be larger than the actual effect you are trying to estimate. It should be possible to use techniques such as hierarchical/multilevel models, and with some clever trial design, one might be able to make the control group smaller, but I can’t see how it could be dispensed with altogether. However, this opens up the problem of deciding which prior data should be incorporated into the model (and all the selection issues that I mentioned in the second post).

          2. You appear to believe that the treatment of one patient substantially effects the response of another? I simply disagree.

            All other sources of trial error you’ve mentioned appear at least plausibly controllable, and I think we should control them so as to make the clinical trial system work better.

          3. I doubt Simon is arguing against data points being independent, but that they are not identically distributed from trial to trial. Who knows which kind of different unobserved selection biases are lurking among these different datasets. IIDness is often very unreasonable.

            The problem is way harder than it seems, and there is no shortage of literature on observational studies that attempts to integrate different studies that were not randomized. I just think that better optimization algorithms are very far from the main issue here. What is needed are ways of integrating assumptions linking different observational regimes and the randomized ones: combining useful “default” assumptions and domain knowledge.

            Incidentally, there is some sophisticated literature on planning and reinforcement learning in the medical literature, often under other names. See for example the works of James Robins and Susan Murphy.

          4. Well that depends on what you mean by “effects the response of another [patient]”: of course there is no direct causal relationship between two patients, but the fact that they were treated in the same hospital, in the same trial or by the same doctor means they cannot be regarded as independent (from the sample space of all possible patients). However, it is usually valid to assume that they are *conditionally* independent, given the confounding variables (hospital/doctor/etc.)

            My point is that these effects ARE controllable, but only with the use of control groups.

          5. Ok, I think I understand—you think the second ‘I’ is violated.

            But, I think there’s a misdefinition of the distribution here. Instead of defining it as the events that a particular doctor in a particular hospital treats, it seems superior to define it as the outcome of a random (hospital,doctor,patient) triple drawn according to the distribution over such cases. This definition works better, as improved outcomes with respect to this definition are exactly what we want. This definition also has the property that it typically doesn’t change very fast, implying the second ‘I’ is reasonably sound.

            So, where we might still differ is in whether or not this distribution can be approximately sampled from effectively. There are certainly organizational difficulties in doing such, but I’m not ready yet to declare them impossible.

            Ricardo, do you have a summary of Robins & Murphy work?

          6. Hi, John

            Much of Robins’ work addresses one of the main points you are making: on how to adjust different datasets so that they behave as if (or close to) what you would get according to a particular policy – without explicitly having data sampled under the policy. One of his classic examples from the 1980s is how to estimate a multiple stage sequence of treatments for AIDS using different doses of AZT and other treatment options, with data that wasn’t necessarily randomized. In other words, how to reduce the problem of evaluating the outcome of a policy for which we have no or little data to the problem in which we use data recorded under a “natural policy” (i.e., observational data). Frameworks like potential outcomes and causal networks are used in this case.

            Susan Murphy also addresses the problem of optimal treatment assignment (i.e., minimizing regret of wrong decisions given to a sequence of treatments). This is pretty much reinforcement learning, and now I realize she also started to publish in some machine learning venues (nice!).

            This paper by Moodie, Richardson and Stephens provides some summary, but for the planning problem only if I’m not mistaken (where the model is known, but not the optimal policy):


            From the computational perspective, I think these problems tend to be much simpler than the typical ones we find the machine learning and robotics literature: action and state space tend to be much smaller (maybe I’m just ignorant of what physicians would like to achieve but don’t even try). But actually learning the model is very hard in the case of observational studies with little experimental data, which ends up being the most common case. This is truly an exciting problem, but it is quite hard to evaluate precisely because experimental data is relatively scarce and lives are at stake (it shouldn’t stop people from providing interesting theoretical contributions, though).

  3. I had the same reaction to the article. (Note that death is not the only outcome in these sorts of trials, just the most dramatic one.) At IAAI this year, Marty Tenenbaum gave a pitch for (1) collecting the right sort of information to allow for the kinds of approaches John is advocating and (2) engaging AI (and Machine Learning!) people in the search for treatment schemes. I found it very eye opening and am now trying to get involved in this area.

    I think, more than the communication and education issues, the lack of systematic data collection outside of clinical trials is a huge impediment. Efforts to remedy the data issue should be primary, in my opinion. Then, we can start having the interesting sociological battles. 🙂 For example, the idea of adopting the gold standard of basing treatment on established results from carefully run trials is known as “evidence based medicine”. The idea began being popularized only as recently as the 1990s. It is still met with resistance and skepticism by a (shrinking but vocal) number of physicians. (My spouse is a doctor and has seen these reactions first hand.) Also, the technical issues (for example, dealing with spurious correlations and doing appropriate feature selection so outcomes are generalized properly) are not trivial. But, overall, I’m very excited by the promise of RL and Machine Learning exerting a positive influence on medical care!

  4. One thing I find appalling (and this was alluded to in earlier comments) is that basically everyone knows the effect of the control drug will be. It’s not like this is a paired test where you’re controlling for per-person variation. So at the very least, I don’t see why they have to sample it 50/50 when there is such an overwhelming prior on the effectiveness of the control drug.

  5. Nice hyperbole. The trial did not kill anyone, it denied a potential treatment to a patient.

    1. The trial failed to save more lives than it could’ve. In other words, it wasted lives. Whether or not you consider that killing is somewhat besides the point.

  6. Ricardo Silva
    The problem is way harder than it seems, and there is no shortage of literature on observational studies that attempts to integrate different studies that were not randomized.

    Have you heard of this controversial observational study: The China Study, A Formal Analysis and Response?
    Human lives are at stake too but since it is over the long run it is even more murky.
    Does any of the opposed stances on this make any sense or is it just “ideological” and plain bulls**t on both sides?

    1. I don’t think I’m nowhere close to be informed enough to give an opinion on this, but you’re right that sometimes a lot of a discussion on the outcome of such studies (which I’m not saying is the case here) is the consequence of people defending their pet theories.

      If you are interested on the evaluation of observational studies, a good starting point is looking at cases where there was a randomized follow-up. But even there other issues might arise, as in the recent case of hormonal therapy in women and the link to heart disease. For instance, see

      Hernán MA, Robins JM. (2008). Authors’ Response, Part I: Observational Studies Analyzed Like Randomized Experiments: An Application to Postmenopausal Hormone Therapy and Coronary Heart Disease. Epidemiology 19(6):766-779.

  7. completely unconvinced.

    I dont buy this regret idea at all. The issue is that I dont believe doctors can honestly randomize when they provide a test procedure — it’s inherent in human nature to give the procedure to those who really need it. Add in the placebo effect, and this is a total mess. I dont see an honest way to keep the human out of the loop unless honest controls are done (particularly for drugs treatments and surgical treatments).

    In fact, the error in having a ‘bad’ procedure become accepted is huge. Just look at cases where it is difficult to do a controlled study, say open heart surgery (e.g. bypasses). How can we possibly control for this? No one who isn’t in a severe condition is going to allow their chest to get cut open? And anyone how does, is going to have a gigantic physchological effect (helping recovery) if you just cut their chest open. This is a now standard surgery which is pretty questionable if it actually helps.

    1. I suspect you need to read and think about it a bit more thoroughly, as step (1) is explicitly about not having doctors randomize.

  8. “where a little massaging of the design in obvious-to-us ways so as to suit these applications would go a long ways”

    I fail to see how little massaging could transform EXP3.P or EXP4 into an useful algorithm for the type of problem you mention, although I’m familiar with both algorithms (it’s probably just my fault, but I imagine how difficult it would be for an outsider to model the problem).
    I feel it is not obvious at all, and a resulting algorithm would have to optimize with respect to different performance measures. So, would someone care to sketch (or write a paper about it in the following period) on how to do this?

    1. One basic issue is that people often are a bit sloppy when proving theorems, not in the sense of correctness, but rather in the sense of ‘tightest possible bound’. There are tricks which allow tighter bounds to be applied inside the core theorems.

      The second basic issue is a matter of definition matching. What’s “reward” in a clinical trial?

      A third issue has to do with time delay effects. This requires tweaking the core algorithm & theorem to deal with latency and one sided reward bounds (i.e. the patient has lived at least 9 months under the treatment but may live longer).

      Some of these issues are subtle enough to require some careful rethinking.

  9. Consider this simple thought experiment. Randomized trial for AIDS (or other historically increasing prevalence disease) vaccine A performed in the 50s. Treatment gets AIDS at a rate of 0.1%, control 0.5% (forget side effects for the sake of the argument). Vaccine is approved. New vaccine B tested in 2000 as resistance develops to old vaccine. Treatment gets AIDS at a rate of 1%, control 5%. Wait, having a new control is rejected on ethical grounds. So old control is used. New vaccine is declared ineffective. Pandemic ensues. Does that “deny the effectiveness of clinical trials”?

    1. The AIDS storyline doesn’t work because it wasn’t around in the 50’s, and there definitely wasn’t a vaccine.

      Aside from that, the flaw in this methodology is that you aren’t measuring current practice in 2000. This should be easy—after all it is widely used.

Comments are closed.