We conduct clinical trials to find effective treatments. Much like Thomas Edison and the light bulb, one trial is rarely enough. In difficult indications, we may test hundreds of potential therapies, utilizing thousands of patients and billions of dollars. Yet despite this required investment of decades, we design trials one at a time.
Our core question must be “what is the fastest way to find an effective therapy?”. Given we usually need multiple trials, it is wise to think about the optimal design trial collections. How many therapies should we investigate at a time? How should we balance thoroughly investigating each compound against the opportunity cost of not exploring others? We do not pretend to solve this problem here, but we illustrate examples that show the benefit of thinking at the collection level. These examples rely on Bayesian thinking, explicit consideration of the time to success across trials, consideration of opportunity costs, and aggressive exploration of many possible therapies through designs such as platform trials. These ideas can cut the time to find effective therapies by 50%.
Some ground rules
We cannot consider this problem in its full generality, so we will assume we have a fixed patient queue available per year and a collection of possible therapies to investigate. A “design” in this context is a strategy for exploring the set of possible therapies. This might be as a simple as “pick one, run a standard trial, if it fails pick another possible therapy, run another standard trial…etc”.
We will measure the quality of a process through two quantities
- The probability of achieving a successful trial by time T (e.g. a trial with results that further approval), as a function of T. As we go through time, any reasonable process will have an increasing chance of finding an effective therapy. We will either want to maximize that probability for a particular T, or simply have that probability go up as quickly as possible with T.
- The probability the process stops on a truly effective therapy. False positive trials are possible. A successful trial is only truly valuable if we stop with a truly effective therapy. Note this is an inherently Bayesian quantity, which requires us to consider the probability investigational therapies are effective.
This problem is most interesting in difficult indications, where relatively few of our potential therapies will be effective. A key insight will be a preference toward investigating many therapies. If we want to find a truly effective therapy as quickly as possible, we should fail quickly and invest the savings forward to new compounds. The rate limiting factor is not the quality of our individual trials; rather, it’s getting an effective therapy under investigation in the first place.
Example 1 – How many therapies to investigate?
I ran a twitter poll a while back which asked the following question: You have 240 patients available and can investigate as many potential therapies as you want, with each therapy having a 20% chance of having the desired disease effect. How many do you investigate?
The options were
3, each with N=80 and 90.4% power
4, each with N=60 and 80.7% power
5, each with N=48 and 71.5% power
More than 5
The options above are all obtainable numbers (I had a treatment effect in mind and found the power for each sample size, I just didn’t include all the details in 280 characters). In the context of our criteria above, we want to maximize the chance of finding an effective therapy by the time we have enrolled 240 patients.
What is the key tradeoff here? Clearly investigating more therapies allows us a better chance of having an effective therapy under consideration. Every therapy has a 20% chance of being effective. If we investigate 3, there is a 48.8% one of our three will be effective. Investigate 5 and that chance increases to 67.2%. Leaving the cure on the shelf while we thoroughly investigate duds doesn’t help patients. However, the reduced power with more therapies increases our chance of missing something effective. Overlooking an effective therapy in a trial also doesn’t help patients. Which is the greater risk?
With 3 therapies and 90.4% power, each of the 3 trials has a (0.20)*(0.904) = 0.1808 chance of investigating an effective therapy and resulting in a successful trial. The odds of getting at least one truly effective success in 3 tries is 1-(1-0.1808)^3 = 0.4502.
With 5 therapies and 71.5% power, each trial has a (0.20)*(0.715) = 0.1430 chance of successfully finding an effective therapy. But we get 5 chances. The odds of getting at least one truly effective success in 5 tries is 1-(1-0.1430)^5 = 0.5377. We have increased our chance of finding a successful therapy by 53.77 – 45.02 = 8.75%.
The risk of leaving the cure on the shelf is greater than the risk of missing a cure in the trial.
By the same logic we can find investigating 4 therapies has a 50.54% chance of finding a truly effective therapy), making our options
3, each with N=80 and 90.4% power (45.02% chance of finding effective therapy)
4, each with N=60 and 80.7% power (50.54% chance of finding effective therapy)
5, each with N=48 and 71.5% power (53.77% chance of finding effective therapy)
A possible objection to investigating multiple therapies is the potential for false positives, our criteria (2) above. This cost, fortunately, is minor. Here we turn to Bayesian thinking and compute the probability a successful trial is a true positive.
Successful trials could result from a truly effective therapy being successful, or from a null therapy achieving a type I error. To find the probability a successful trial identified a truly effective therapy, we turn to Bayes rule. We assumed 20% of our therapies are effective (Pr(effective) = 0.20). Of those 20%, 90.4% (the power) will result in successful trials. In total (0.20)*(0.904) = 0.1808 of our trial will result in successful trials with effective therapies. Additionally, 2.5% (the type I error rate) of the 80% “null” therapies will result in successful trials, or (0.80)*(0.025) = 0.0200 of the total.
If we get a successful trial, it’s either one the 18.08% with an effective therapy, or one of the 2.00% with a null therapy. The chance of an effective therapy is 0.1808/(0.1808+0.0200) = 90.03%. Not a guarantee, but our successful trial likely had an effective therapy.
When investigating 5 therapies, the power is reduced to 71.5%. Repeating this calculation the probability a successful trial had a truly effective therapy is 87.7%. This is a very modest drop given the increased chance of investigating and identifying an effective therapy in the extra 2 attempts.
This design may be improved, for example, by starting with 5 therapies and dropping poorly performing therapies for futility, investing the savings forward to increase the sample sizes for the other therapies and thereby increasing power. This is fundamental to platform trials, but before we turn to that, we consider an example of a “one at a time” process.
Example 2 – To futility or not to futility? That is the question.
We often discuss potential futility rules with clients. This can be a difficult decision in the context of a single trial, but often is a very clear decision in the context of a trial collection.
Start with a standard trial with a dichotomous outcome. Current standard of care has a 20% responder rate, and we are hoping to achieve a 30% responder rate with new therapies. As such, as are planning a 400 patient per arm (N=800 total) study which achieves 90.6% power for the desired effect.
Standard futility rules can significantly save sample size if the therapy is ineffective. A standard rule based on predictive probabilities creates a tradeoff that must be weighed by the sponsor. If the therapy is not effective, we significantly save on sample size. The actual sample size is random, but the expected sample size drops to N=400 total, saving on average 50% of our total. Unfortunately, when our therapy is truly effective, we might incorrectly stop some trials early that might have gone on to be successful, resulting in a loss of power. For our desired effect, power becomes 87.1% (thus we lose 3.5% of our original 90.6% power).
This is often a difficult choice for a sponsor, particularly a biotech with a single compound or an academic needing stability in funding a lab. Even for large pharma companies which may have several alternative compounds to investigate, individual teams are highly invested in their individual compounds.
Viewing this trial as one of many, we can quantify the opportunity cost of futility. Suppose this were an intractable indication, where we might expect only 10% of therapies to reach our desired 30% responder rate, with the rest nulls.
We could sequentially test compounds, with or without using the futility rule. Let’s begin without futility, allowing all trials to go to N=800 and maximizing power at 90.6%. Each of these trials has a 11.31% chance of being successful (10% of therapies are effective with 90.6% power, combined with the 90% of ineffective therapies with 2.5% type I error). Note again, using Bayes theorem, the positive predictive probability (the chance a successful trial truly has an effective therapy) is 80.1%.
If we compute the time to get to a successful trial (governed by a negative binomial distribution) we find that probability is 11.31% at N=800 (1 trial), 45.1% after N=4000 (5 trials), and 69.9% after N=8000 (10 trials). The red dots in Figure 1 show the increasing probability of a successful trial over time/patients. Given all trials are N=800, the red dots “jump” at the completion of each discrete trial. We will eventually obtain a successful trial, but it may be a while.
Suppose we add the futility rules. We lower power slightly, but we can cycle through compounds much more quickly. With only 10% of potential therapies effective, getting something effective under investigation is the rate limiting factor. Note that futility actually improves the positive probability value, which increases to 82.3% (futility proportionally reduces type I error more than power).
Determining the probability of a successful trial over time is harder with futility because of the random sample sizes in each trial. We simulated 10,000 processes. Each process is a sequence of trials conducted using the futility rules, stopping each process when a successful trial is obtained. Using these simulations, we can determine the probability this process will produce a successful trial by each sample size.
The black dots in Figure 1 show the increased probability of success from utilizing the futility rules. While starting slightly lower at N=800 (the slightly decreased power in the first trial), the improved cycling of compounds quickly kicks in. At N=4800, for example, the no futility process has a 51.3% of having reached a trial success. The futility process, in contrast, has a 69.2% chance of having reached a trial success, all with improved positive predictive value.
Again, the value of cycling multiple compounds heavily outweighs any power loss.
Figure 1 – Increasing chance of trial success as time and patients accrue
Example 3 – Platform trials
An adaptive platform trial is a far more “state of the art” implementation of the concepts in Examples 1 and 2. In an adaptive platform trial, multiple compounds are investigated at once (the number of “slots” in the platform trial). We conduct multiple interim analyses, and at each interim evaluate each compound currently in the trial, allowing each to continue, stop for futility, or potentially stop for success (not considered in the examples above). Anytime a compound leaves the trial, either for success or futility, it is replaced by a new investigative compound. Thus, much like example 2 (futility), we invest our sample size savings forward toward new compounds. This could allow us to cycle more compounds or increase sample size for continuing compounds.
Recall Example 1, and how the power decreased as we moved from 3 compounds (N=80 each) to 5 compounds (N=48 each). Quickly declaring futility and reallocating resources mitigates this problem. Suppose we start with 5 compounds and after 24 patients in each compound we conduct an interim analysis which stops 3 of the 5 compounds. This saves 72 patients, and is quite reasonable to expect given our results in Example 2. We could start more compounds, or we could increase the sample size of the two remaining compounds from 48 to 84, which is higher than what we had investigating 3 compounds and would recover the power. We could investigate 5 compounds and obtain equivalent power to investigating 3 compounds non-adaptively.
In designing a platform trial, we can select factors such as maximal sample size for each compound, how many compounds are in the trial at once, and the rules for removing compounds from the trial for success or futility. I am unaware of hard and fast rules for optimizing these choices, but during the design process we can simulate multiple choices and optimize any metric we choose, for example the time needed to find a successful therapy as shown above.
Unfortunately, a full discussion of platform trials is far beyond the space of this blog, but my colleagues Ben Saville and Scott Berry have published an exploration in Clinical Trials.
In this paper, Ben and Scott lay out several design processes (one at a time, one at a time with futility, shared control, and a full adaptive platform) and illustrate the time required to obtain a successful trial (with the probability that a successful therapy is truly effective).
The adaptive platform trial saves 50% over a non-adaptive “one at a time” approach. Much of this advantage comes from the concepts shown in Examples 1 and 2, combined with the shared control arm as part of the platform trial.
As a society, we want to cure disease as quickly as possible. Optimally achieving this goal requires us to think about trials as a collection, all contributing toward a common goal. By investigating more therapies, driven by sample size savings achieved from futility or shared control, we can significantly increase the rate we find effective therapies.