@KertViele

Highly powered clinical trials are expensive. The sample sizes involved are a sum of the information you likely need to answer the question, and an expensive insurance policy against bad luck.  This insurance policy is part of the reason we see “significant but meaningless” results. It also wastes resources. We are like a city of homeowners all paying full replacement price for our homes as our insurance premiums. There are alternatives…

Introduction

Suppose I have a novel therapy. I hope my therapy will improve outcomes on an accepted disease scale by 5 points.  I go to my friendly neighborhood statistician for a sample size. The statistician asks me the alpha level (2.5% one sided, standard medical trial) my desired power (90%), and my anticipated standard deviation for the scale (20 points).  The statistician then tells me I need 337 patients per arm. After profusely thanking the statistician and promising an enormous line in my grant budget for statisticians (ok, sorry, got carried away there…), I start thinking about what different results from my study might mean. Dreams of a publication with p<0.001 go through my head. I ask the statistician “suppose I get my hoped for 5-point improvement. What will be the p-value? What effect do I need to see in my study to get p=0.001?”

Rejection regions and type 1 and type 2 errors (if this is old news to you, just skip to the next section)

Most trialists are familiar with type 1 error, the chance of mistakenly rejecting “no effect” when in fact the therapy has no effect, and type 2 error, the chance of mistakenly NOT rejecting “no effect” when the therapy actually works. Power is just one minus the probability of type 2 error. In a highly powered standard clinical trial, we try to make the chance of both errors small, so we use alpha = Probability(type 1 error) = 2.5% and we typically choose power of 90% (or at least 80%) to limit the chance of type 2 error at 100-90%=10% (or at most 20%) for our hoped for effect.

The force we are fighting against is sampling error. A dud therapy can have random high results in a trial, fooling us into thinking it works when it doesn’t. A good therapy can have random low results. We don’t want to miss a good therapy because it had bad luck in a trial. The sample size of 337 per arm protects against these risks as shown in Figure 1.

Figure 1. Graphical description of our standard clinical trial, N=337 per arm. The red shows the distribution of observed effect when the therapy has true effect 0 (null), and the green shows the distribution of observed effect when the therapy truly has our hoped for 5-point effect. The vertical line shows the threshold for rejecting the null hypothesis of no effect, which simultaneously achieves 2.5% type I error under the red and 90% power under the green.

In Figure 1 the red distribution shows the distribution of observed treatment effects when the therapy doesn’t work (true effect is 0). While the true effect is 0, observed effects anywhere from plus 3 to minus 3 are decently likely to occur. Getting an observed effect of 2, for example, is well within this sampling variability and not conclusive evidence the therapy works. The solid vertical line is the threshold for rejecting the null hypothesis of no effect. The vertical line sits at a treatment effect of 3.02, above 97.5% of the null distribution. There is only a 2.5% chance that a null (no effect) therapy will produce an observed effect above the vertical line. This limits our chance of type 1 error at alpha=0.025. The green distribution shows the range of likely values when the therapy achieves the hoped for 5-point effect. Values anywhere from 2 to 8 are decently likely. Fortunately, if the true effect is 5 points, then we have a 90% chance of getting a value above the vertical line and rejecting the null hypothesis. This is our 90% power. Why did we choose N=337 per arm? The red and green distributions get narrower as the sample size increases. N=337 is the smallest sample size where we can find a cutoff that simultaneously gives us 2.5% type I error and 90% power. Smaller N results in too much overlap.

I had a poll on Twitter recently that asked “You run a standard trial (two arms, normal situation, Z-score test) that is 90% powered for an effect X. If your data end up with an observed effect equal to X, what is the p-value?”. The p-value for a trial is the probability your observed effect or larger would be observed under the null hypothesis. In terms of Figure 1, the p-value is the proportion of the red distribution that lies above your observed value. If the observed value is 5, you can see virtually none of the red distribution is above 5. The p-value is 0.0006 (one sided, 0.0012 two sided).

A p-value of 0.025 corresponds to the vertical line, which was constructed to have only 2.5% of the red distribution above it. An observed effect at the vertical line of 3.02, far less than our hoped for 5, obtains a p-value of 0.025. And if you dream of a p<0.001 (one sided), you get that for any observed effect above 3.58, again much less than your hoped for 5.

Some ramifications…

From the previous section, our threshold for rejecting the null hypothesis is 3.02, only 60.4% of our hoped for 5-point improvement. If we actually achieved an observed effect of 5, the p-value would be 0.0006 (one-sided). The 60.4% relationship applies to most standard clinical trials settings which are based on means and proportions. As noted by one respondent, this is dependent on getting the standard deviation right. If you missed your standard deviation of 20, this relationship changes.

The first ramification of this is the phenomenon of statistically significant but clinically meaningless results. You should always power your trial for a meaningful effect, so I’m assuming 5 points is meaningful. After getting the result from your statistician, you should always ask “what is the smallest treatment effect that results in significance”. Here that is 3.02. If 3.02 is not meaningful, then the trial design has the potential to produce a statistically significant but not clinically significant result. If you are looking for high power, the threshold for rejecting the null hypothesis has to be less than the effect you power for, because you are protecting against bad luck. You should know what it is and think about how it would be interpreted. If the threshold is not meaningful, you should consider alternatives.

I’m going to spend more time on the second ramification.  A p-value of 0.0006 is pretty small. If we observe a value near 5, odds are we achieved significance long before N=337 per arm. N=337 was not chosen because it is always needed, it was chosen because it is sometimes needed. N=337 protects against bad luck when our therapy works, but bad luck doesn’t always happen. If our therapy has a true effect of 5, half the time we will get an observed effect above 5 and a p-value below 0.0006.

A large portion of our N=337 per arm is therefore protecting against a risk that doesn’t always happen. It’s an insurance policy against random lows. For something like home insurance, we are spreading out risk. My home insurance will rebuild my home when certain rare events happen. I wouldn’t pay the full replacement cost of my house as a premium, because those events are rare. Insurance works, and costs less than the full cost of my house, because I share that risk among other homeowners.

The “clinical trial insurance” we buy against bad luck (random lows when the drug works) are typically not shared. Every trial pays “full price” so to speak. How might we share these risks?

One option is flexible sample sizes. If our observed effect is closer to the hoped for 5, we don’t need to make a “claim” on our insurance policy, and we should be able to stop the trial short of N=337. When the effect is smaller but still valuable, then we need closer to N=337. More importantly, this should apply to futility as well. Just as values of 5 tend to achieve significance much earlier than N=337, observed effects near 0 have limited hope of every achieving significantly and should also be stopped. Much like we all pay homeowners premiums with insurance payouts to those whose homes are damaged or destroyed, clinical trials should have a minimum sample size that is always needed and share the risk of bad luck requiring a much large sample size to draw a firm conclusion.

A numerical example.

Apologies for switching examples, but it’s convenient (this one is from a short course I gave). We are designing a trial with a dichotomous outcome (responder or not). We anticipate 30% responders on control and hope for 50% responders on treatment.

If we enroll N=100 per arm (200 total), we achieve 83.3% power. This trial has the same issue as before. If we actually observed 30% control and 50% treatment, the p-value is 0.0016, far less than 0.0250 (it’s not 0.0006 because we have 83.3%, rather than 90%, power). We’ve bought an insurance policy allowing us to reject for random lows in observed treatment effect, and it isn’t always needed.

Suppose we employ a group sequential design. We will look at the data at N=50, 60, 70, 80, and 90 per arm in addition to the final 100 per arm. If the p-value at these interim analyses is sufficiently small, we declare success and stop the trial, otherwise we continue until the next interim analysis. Note we can’t use p<0.025 at each interim analysis because of multiplicities, but there are standard formulas for the p-value thresholds that do maintain overall 2.5% type I error. Table 1 shows those p-value thresholds and the operating characteristics of the trial

N per arm5060708090100
p-value required0.00310.00410.00630.00930.01320.0183
Pr(win at N per arm)0.27950.12840.11440.11580.10030.0790
Pr(lose at N per arm)000000.1824

Table 1. Operating characteristics for a group sequential design. Power 81.8% (compare to 83.3% for fixed N=100 per arm trial) and expected total sample size is 148.3 (compared to 200 total for fixed trial).

The p-values in the table are small for N=50,60,70 per arm, but they are achievable if our therapy actually works. They don’t correspond to “unrealistic” observed effects but are effects quite consistent with our hoped for effect. The Pr(win at N per arm) row shows the probability that we will stop at each interim. There is almost a 28% chance of stopping at N=50 per arm, with about 10% chance of stopping at each of the other interims. Only about 26% of trials reach the final N=200 sample size. Most of the time we didn’t need our full insurance policy.

The power of the trial is reduced to 81.8% as opposed to the original 83.3%. This can either simply be accepted as a tradeoff between sample size and power, or it can be recovered by increasing the maximal sample size. We will explore the latter and increase the maximal size of the trial to N=110 per arm. This trial has the potential to be larger than the fixed N=100 per arm trial, but most of the time it stops before then. Operating characteristics in Table 2

N per arm5060708090100110
p-value required0.00230.00310.00470.00690.00970.01340.0180
Pr(win at N per arm)0.24690.10860.13590.10940.09820.08580.0656
Pr(lose at N per arm)0000000.1495

Table 2. Operating characteristics with maximal sample size increased to obtain equivalent power to a fixed trial. 

Our power is now 85.0% (higher than the fixed 83.3% power). The expected sample size is 156.4, still lower than 200. Only 21.5% of trials reach N=110 per arm. The result is a trial that might be larger, but most of the time is smaller. In we use such trials in the long run, on average we have higher power while saving 22% of the sample size whenever the therapy is truly effective.

Early futility is far more important. While early success allows us to save whenever the therapy is truly effective, we should also be saving whenever the therapy is ineffective. This is unfortunately more likely in many therapeutic areas. We will add a rule which says we will stop if the predictive probability of eventual trial success (at maximal sample size) is less than 5%. While futility is aimed at savings when the therapy is ineffective, we need to make sure it doesn’t have a detrimental effect on effective therapies. Table 3 shows the operating characteristics for effective therapies after adding futility.

N per arm5060708090100110
p-value required0.00230.00310.00470.00690.00970.01340.0180
Pr(win at N per arm)0.2210.1150.1270.1290.1030.0850.052
Pr(lose at N per arm)0.0310.0150.0130.0160.0160.0230.054

Table 3. Operating characteristics after adding futility.

In Table 3 note the power dropped from 85.0% to 83.3% (identical to the original fixed N=200 design). Futility cost some power because a few trials stop early but would have otherwise gone on to success. The expected sample size is now 149.9. Bottom line…under the hoped for treatment effect, we have equivalent power to the fixed trial, and our expected sample size is 50 patients lower.

Futility is much more valuable in the null hypothesis. Suppose the therapy has no effect. Table 4 shows the operating characteristics. Ineffective therapies tend to produce small observed treatment effects which may be stopped early, typically earlier than stopping for success.

N per arm5060708090100110
p-value required0.00230.00310.00470.00690.00970.01340.0180
Pr(win at N per arm)0.0030.0020.0030.0030.0030.0050.004
Pr(lose at N per arm)0.5890.1130.0850.0660.0530.0380.032

Table 4. Operating characteristics of the early success and early futility trial for ineffective therapies. A large proportion of trials are stopped early for futility.

Approximately 80% of trials stop at or before N=70 per arm. The expected sample is 123.1. Futility is extremely effective at stopping bad treatments. The gains from futility are greater than the gains from expected success.

Want NIH to fund 50% more trials with the same money…..?

Unfortunately, most novel treatments don’t work. Suppose we look at a population where 3/10 of late phase therapies work (achieve our desired 50% rate) and 7/10 don’t (no treatment effect, 30% rate in both arms). If we fund fixed trials, we need 200 total patients per trial and each trial achieves 83.3% power. If we use the group sequential design in Tables 3 and 4, we require an average of (0.30*149.9)+(0.70*123.1) = 131 patients per trial (weighing average of the sample sizes for effective and ineffective therapies, weighted by their prevalence). For the same money, in the long run we can fund 52% more trials (200/131=1.52) using this design.

Most of this gain is from futility. At a minimum we need to be more aggressive at stopping trials with poor results. There has traditionally an academic reticence for this because of the major effects of stopping funding. But it’s better for patients, and 52% more grants funding is quite a compensation.