Many thanks to those who attended the first webinar or listened to the recording online (now posted).

There were a number of questions either during the talk or emailed to us after. A couple of the questions touched on controversial topics (especially adaptive randomization in two arms). I’ll certainly try to be inclusive of that controversy in my response but always feel free to add to the comments section. Discussion is good!

The questions below cover 1) the relationship between flexible sample sizes and phases of development, 2) the use of adaptive randomization in two-arm trials, 3) more details on how accurately an arm can be dropped, 4) global rank endpoints, and 5) more information on historical borrowing.

 

1) The relationship between flexible sample sizes and phases of development.

Online you can find rough guidelines for sizes of trials relative to different phases of development. Additionally, funding mechanisms may place budget caps for different phases, inducing a cap on the number of available patients.

Usually flexible sample sizes are implemented in a protocol by indicating a range, such as “the trial will enroll between 250 and 500 subjects, governed by a group sequential design”. The protocol needs to carefully lay out the rules governing the selection of sample size, usually the timing of the interim analyses and the decisions that can be made at each interim. The sample size justification is typically identical to a fixed sample size trial, noting power for the design as a whole. Often further details, such as the likelihood of stopping the trial at each interim analysis, are included as an appendix to the protocol.

If the desired sample size range extends beyond a funding maximum, this obviously creates a problem. If you have funding for N=400 subjects and your desired range is N=250-500, you either need to constrain the range to N=250-400 or find a creative way to fund the trial past N=400.  This type of conflict represents one way in which resources available to industry allows greater innovation. In repeated use, we may find that trials with a range of 250-500 patients average 400 patients each, but we save resources on poorly performing therapies by stopping for futility and invest those resources in going to N=500 for therapies that require the additional sample size. If we were constrained to use exactly N=400 for each trial, we overinvest in weak therapies and underinvest in several promising ones.

 

2)  Response adaptive randomization in two arm trials

Response adaptive randomization refers to changing allocation probabilities based on accumulating data, lowering allocation to poorly performing arms and increasing allocation to arms performing well. Arm dropping is a special case of adaptive randomization where you reduce the allocation to 0, completely stopping an arm. Done correctly, these methods benefit patients within the trial, as they on average increase the number of patients assigned the more effective arms regardless of the number of arms.

The benefit to patients outside the trial (society) is more complex. We want the trial to have a high probability of identifying the best arm for use by future subjects. This can be measured by the power of the trial and in multiple-arm settings the likelihood the best active arm is chosen.

There are vast differences between the two-arm setting and the multiple-arm setting. In two-arm settings, our fundamental question is the comparison between treatment and control. Arm dropping in this context is equivalent to ending the trial and is still sensible once our question is answered. Suppose, however, one arm is performing better but not sufficiently better to conclusively answer our question. We continue to need information about both arms, treatment and control. Adaptive randomization increases allocation and information about one arm (the better performing one), but only at the cost of less information about the other arm. The net effect of this reallocation is to reduce the power of the trial, a clear detriment.

In multiple-arm settings you need not make this tradeoff. If you are interested in identifying the best active arm among several and then comparing that arm to control, lowering allocation to poorly performing active arms allows you to increase allocation to both the best performing arm and control. You thereby gain information on both sides of your desired comparison. This increases the power of a study and increases the chance of choosing the correct best active arm. Note if power is a concern, you want to be careful not to select a variant of adaptive randomization which decreases control allocation over time. Only lower allocation to poorly performing active arms, always maintaining or increasing control allocation.

Both adaptive randomization and arm dropping can be implemented in a variety of ways. When do you start adapting? How aggressively? How often? It’s currently an open research question regarding the optimal way to perform each. In the multiple-arm setting, each provides a clear advantage over non-adaptive trials.

 

3) For adaptive designs that drop arms (interventions or doses were given as examples) for “doing badly,” how is “doing badly” evaluated? Is this by looking at early primary outcome data (presumably therefore not fully powered), intermediate or surrogate outcomes, …? If primary outcomes, it would seem that stringent statistical approaches are required, and that either power to drop arms midway through would be low (if little alpha “spent” in interim evaluations) or the overall sample size larger (if substantial alpha “spent”).

An excellent question with lots of parts. “Doing badly” is a designer choice typically based on either a comparison with control or a comparison to the other arms in the study. You might drop arms which have an insufficiently low p-value compared to control, or within a Bayesian paradigm drop arms which have a limited chance of being superior to control or a limited chance of being the best active arm among the arms remaining in the study. The thresholds for each of these measures may be different and are typically “tuned” as part of the design process to obtain optimal operating characteristics.

These decisions are usually made on primary outcomes when those primary outcomes are available. If the primary endpoint is significantly delayed, we would consider an alternative endpoint for adaptation, often an earlier version of the same endpoint. For example, if we had a 2-year endpoint, we may find limited primary outcome data at an interim. However, we might have significantly more 6 month or 1-year measures of the same endpoint. Taking care to be sure these earlier measures are sufficiently correlated with the primary endpoint, we might drop arms based on those measures or biomarkers. If they aren’t predictive, then you cannot drop arms effectively.

These earlier comparisons are definitely not fully powered, but they can still be accurate, particularly for futility and arm dropping. Suppose we are running a 200 patient trial and need an 8% gain in a rate to declare efficacy for a novel therapy. Suppose 120 patients in we see no effect (0%). We’ve got 80 patients left and have 120 patients worth of 0%. To get an 8% effect at the end of the trial, those 80 future patients must have a 20% effect to average out to 8% overall. Given there is no evidence of any effect whatsoever so far, it is very unlikely we are going to see 20% for the remaining 80 patients. It’s reasonable to stop the trial for futility or, if this is just one arm among many, to drop this arm. Such a situation is common. If our therapy or arm does in truth does nothing this kind of outcome or worse occurs about half the time, so such a futility rule has a decent chance of saving us those 80 patients.

A fully powered trial is aimed at achieving high accuracy under both the null and alternative hypothesis, even in the presence of bad luck. We don’t want to be deceived by a random high when the therapy doesn’t work, nor by a random low when the therapy works well. But bad luck doesn’t always happen. We need the full sample size to account for the full array of possible data sets, but many data sets provide a conclusive answer earlier. We can act on those data when they appear, saving our longest trials for when early data is more uncertain.

 

4) Global rank endpoints

We received an email question about global rank endpoints. These composites take multiple clinical endpoints and combine them into a group of paired comparisons. As a simple example, suppose two endpoints of interest are mortality and 6-minute walk. Mortality is more important, so we rank these as mortality first and 6-minute walk second. We observe both endpoints for all patients. We then consider every pair of patients. If one died and the other lived, the survivor is better in the comparison. If both patients survived, then we move on and compare 6-minute walk times, again determining which patient is better. For every pair of patients, we identify the superior performing patient, defined as the patient who gets a better result on the first non-tied endpoint. You then analyze all the pairs with rank based statistical tests.

These endpoints are valuable in terms of incorporating multiple endpoints into one analysis, providing more information than either endpoint would provide individually. They represent a way to increase power, particularly if a treatment has modest consistent effects on a variety of endpoints, as these effects are partially combined in the final global rank composite.

They do come with limitations. You have to be clever to incorporate multiple continuous endpoints as they will rarely be tied and hence you may never get to lower ranked endpoints. In addition, there are implied weightings between the endpoints. In our example above, you might be implying a 10% mortality difference with no change in 6-minute walk is equivalent to no difference in mortality and a 15% better chance of having an improved 6-minute walk (these create the same probabilities of a paired comparison being in favor of a given treatment). You should know these implied weightings prior to starting the trial. With a utility approach, in contrast, you would specify these tradeoffs directly as part of the trial design process. Global rank endpoints may also have difficulty when there is no obvious ranking of the endpoints. In Duchenne Muscular Dystrophy, endpoints like 6-minute walk are used while patients are still ambulatory. Alternative endpoints are required to investigate disease course after patients are unable to walk. If we wanted to enroll patients across the full range of the disease, we need a way of combining those endpoints but cannot declare one endpoint more important than the other. Disease progression models are applicable in this context, where changes in each endpoint are weighted based on the individual patient status.

 

5) I am interested in the historical borrowing innovation….Do you need individual control patient data from similar trials? Could you direct me to additional resources where I could learn more?

This is an area of active research. You can use individual patient data or summary data depending on the context. If your endpoint is dichotomous, for example, individual patient data is most valuable, but summary rate data can also be used. Time to event data tends to be much harder to use, as standard summary data is typically not sufficient (meant both informally and in the formal statistical sense of the word) to recover the needed information for historical borrowing.

You definitely want similar trials. One of the key ideas here is “drift”, meaning the difference between the observed historical values and the true control rate in your trial. If you have trials from 30 years ago where 60% of patients survived on a therapy, and currently closer to 80% of patients survive with modern improvements to that therapy, then you have a lot of drift and borrowing is not advisable. You want historical trials that are contemporary and have similar inclusion/exclusion criteria. Some of these criteria are found in Pocock (1976) (PUBMED) and this remains an active research area, particularly with the advent of pharmaceutical cooperatives like Transcelerate that share historical data among member companies.

The historical trials should be close but don’t have to be identical. Typically, there is a “sweet spot” of similarity where you gain in both estimating and testing accuracy. This sweet spot must be assessed on a case by case basis and is a function of the sample sizes of the current and historical trials and the exact method used for borrowing. Some methods have a broad sweet spot that delivers smaller benefits, while others have a narrow sweet spot with large benefits. Trials with smaller sample sizes, such as rare diseases or early phase trials, produce a larger sweet spot because of the inherent variability in the current data. If the historical data doesn’t match the current trial, you get biases and poorly performing tests. You can find a detailed review of these issues in the DIA working group review paper on historical borrowing (PUBMED) and in the recent paper by Lim et. al (2018) (PUBMED)

The methodology in this area is general accepted, with the selection of historical data being the more controversial piece. All sorts of biases can exist in the publication and selection of historical data. If you are going to perform this kind of trial, you should make sure you have someone experienced to enumerate the benefit/risk tradeoff. The single largest mistake I see in this area is assuming that the historical data will be exactly on point, with no assessment of the potential pitfalls if there is a mismatch.

 

Again, our appreciation for all who attended our first webinar!

– Kert Viele