A Statistician reads JAMA - podcast episode cover

A Statistician reads JAMA

Jun 30, 202539 minEp. 18
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Dr. Scott Berry applies a statistician’s review of a random trial result published in JAMA – the FAIR-HF2 clinical trial.  Interrogating the frequentist paradigm and the focus on the binary outcome of the primary hypothesis test. He scrutinizes the Hochberg multiplicity adjustment, challenges the prevailing disregard for accumulated scientific evidence, and contrasts the limitations of black/white view of clinical trial of over 1000 patients and 6 years of enrollment. A contrast is made to what a potential Bayesian approach, grounded in practical trial interpretation and evidence integration would look like. The episode argues how current norms, created by dogmatic statistical views, in clinical trial analysis can obscure or perhaps mislead from meaningful findings and limit the utility of costly, complex studies.

Key Highlights

  • FAIR-HF2 randomized 1,105 patients with heart failure and iron deficiency to intravenous ferric carboxymaltose or placebo across 70 sites, with three pre-specified co-primary analyses.
  • The study relied on the Hochberg procedure to control family-wise error across analyses: (1) time to first cardiovascular death or heart failure hospitalization; (2) total heart failure hospitalizations; (3) time to first event in a highly iron-deficient subgroup.
  • Results showed a favorable hazard ratio (0.79) and a p-value below 0.05 for primary composite 1, but statistical significance was nullified under Hochberg multiplicity criteria as other endpoints failed threshold requirements.
  • Berry challenges the reduction of trial outcomes to discrete “significant” or “not significant” designations—critiquing the scientific and statistical culture that ignores gradient evidence in favor of only black-and-white outcomes.
  • He details the likelihood principle and Bayesian analysis as superior frameworks, quantifying a 98% posterior probability of benefit; he contextualizes findings with prior evidence from the HEART-FID, IRONMAN, and AFFIRM-AHF trials and published meta-analyses—arguing that isolated, negative conclusions defy cumulative data.
  • The discussion extends to the inefficiency of fixed trial designs, the missed value in adaptive methodologies, and the inefficacy of requiring full-scale repeat trials all analyzed in isolation, when evidence already points strongly to a beneficial effect.

Transcript

Judith

Welcome to Berry's In the Interim podcast, where we explore the cutting edge of innovative clinical trial design for the pharmaceutical and medical industries, and so much more. Let's dive in.

Scott Berry

Well, welcome back to In the Interim, I'm Scott Berry. I'm your host. For today of, in the interim, I've a interesting topic for today, I, I used to write a column for CHANCE Magazine and I would quarterly write a column called the A statistician reads the sports pages, and of course, consuming and talking about things that show up in sports. The statistical analysis of this. I did that for, uh, about 10 years actually. Um. Um, in, in that, it was, it was, um, was very rewarding.

So I'm gonna do that today. And this is a statistician reads jama. So I'm gonna tell you about an experience of opening JAMA and giving a read to relatively, uh, a random clinical trial that I read the results of. And it, it, it sparked a number of things I thought very mu, very much worth discussing. Um, by the way, i, I, for those of you out there may, whether you're driving to work in the morning, the afternoon, you're out for your daily run or however you consume.

In the interim, I'd love to hear from you. Let me know what kind of topics you would like to talk about. If. Particular people you'd like on the on, in the interim. Love to hear what you'd like to hear about here at Berry Consultants, we have a company of about 35 scientists, static, mostly statistical scientists. We work on clinical trial design. We work on implementing, uh, adaptive trials. We have software for simulating trials.

We do a wide range of therapeutic areas, and yes, we focus on innovative trial designs, adaptive trials, bayesian statistics, and so we'd, I, I'd love to hear what kind of topics you'd like to hear about on in the interim. I. So here we go. So this really happened. I got an email from JAMA and it, they, in there, they list different, uh, articles and results of different trials.

And unfortunately I don't get to read these as much as I would like, but one showed up and I thought, you know what, I'm, I'm just gonna read one of these and see what, see what these trials look like. So this trial is the Fair HF two trial, and it was published, uh, the, the, the primary author is Anchor, uh, so Anchor Etal, and it says, published online, March 30th, 2025 in jama. Uh, an or in an original investigation.

So if, if at all what I talk about is interesting, yes, please, please go check out the, the trial. Uh, I was, I'm not involved in the trial. I had nothing to do with it, and I, uh, uh. Only vaguely know a couple of the authors, so I don't, I don't have any involvement in this trial. So the, what is the question? In the trial? The trial, and it says this right in jama, what I. Are the efficacy and safety of intravenous intravenous ferric carbo, carbo x maltose.

I'm going to refer to this as intravenous, um, uh, iron supplement. Uh, the paper refers to it that way. So this is for in, in patients with heart failure and iron deficiency. So is this. Intravenous iron supplement for patients with heart failure and they have an iron deficiency. Is it effective and safe? That's the question in the trial. Okay, so very interesting. So the design, the settings, and the participants, it's a multicenter trial. It's randomized one-to-one. With the iron supplement.

Uh, in there patients with heart failure defined as having A-L-V-E-F left ventricular ejection fraction less than or equal to 45% and having an iron deficiency. I'll let you if, if, if you know what levels of iron deficiency are. Uh, I'll let you go to the paper to, to, to see that by the way, they define a particular highly deficient group, and I'll say something about that as well. But, so everybody's iron deficient and there's a particular subgroup that is, is, uh, highly deficient.

It enrolled at 70 clinical sites in six European countries. Enrolled from March, 2017 to November of 23. Median follow-up of patients was 16 months, uh, in, in the trial. Okay, so again, any of the details, please, uh, uh, read the article. So sounds very interesting. The trial enrolled 1,105 patients, so rather large trial from that perspective, large and long trial. The primary endpoint in the trial, and it's a bit sort of interesting and I'll try to lay this out a little bit.

So the primary endpoint, it's looking at cardiovascular death and heart failure, hospitalization. Very common endpoints in heart failure trials, it, it's going to analyze three different primary. Endpoints where that's an endpoint tied to an analysis in the trial. So the first one is the time to first cardiovascular death or heart failure hospitalization.

Again, a very common way to analyze in heart failure trials is a time to event, so they're doing standard time to event analyses for that endpoint. Additionally, they're doing an analysis of the rate of total hospitalization. So this is, a patient could have multiple heart failure hospitalizations, and they're using rather standard analysis techniques of that for count data, negative binomial analysis for count data, depending on the exposure the patients have.

The third one is analyzing the first endpoint that I described, which is time to first cardiovascular death or hospitalization. But it's restricted to that subgroup that I, that I de, that I described that are highly deficient. And that, so those are the three analyses that are gonna define the primary set of analyses in the trial. Yes. They're gonna control the overall family-wise, experimental error rate by, by analyzing all three of those, uh, analyses.

So sort of co-primary, if you will, in the sense that any one of those could potentially be successful and they adjust for that. Okay, so they use a Berg procedure for analyzing those three, uh, I'll call them endpoints. And I, I struggle with this a little bit because I like to think of endpoint as heart failure, time to heart failure, hospitalization, or cardiovascular death. And how you analyze it or the subgroups aren't really the endpoint, but I'll, I'll describe it that way.

I think it'll be easier to describe it. So they are analyzing these three analyses and in the analysis they've set up a procedure where they will refer to this as statistically significant if any one of the following three things happen. If all three of those are significant. At, and I'm going to do it in the one-sided sense. In, in, in one sided. This is all about superiority. They describe it in the paper two-sided. So I I'll do it. Two-sided, sorry.

So if all three are less than 0.05, two-sided, then the trial's statistically significant and they've demonstrated superiority. The second opportunity is that if two of them are significant at 0.025, so that's half of the original, but if two of them meet 0.025, then the trial demonstrates statistical significance. Yes, if one of them, if any one of them is significant at the point, at 1.67%, two-sided, then it demonstrates statistical significance.

So this is sort of three shots on goal where they're looking at three different analyses. Again, time to cardiovascular death or hospitalization. The the number of heart failure hospitalizations and a subgroup where they analyze time to first, cardiovascular death, or heart failure hospitalization. Okay, so the SAP is is, uh, published as part of it. It's a very well written SAP. The design is reasonably standard.

I found no evidence of any adaptations in the trial, so they enrolled 1,105 patients and carried out the primary analysis. I'm sure there was A-D-S-M-B and safety was reviewed, things like that, but a very traditional trial. Um, and, and a good trial, and I, I, I'm going to talk about the publication of this and the results of this. The, the, the authors should be commended on this trial. The patients involved in the trial and they deserve praise.

So I, I, I hope in, in no way does this come across negative for the people who ran, conducted and published this, this trial. But I want to dive into the science of it. I wanna push a little bit on the science of it, and I want to give my reading as a statistician who randomly picked up this article and, and read it. what my reading of it is. Okay. So the structure set up what happened in the trial. So the trial enrolled 1,105 patients.

Again, it was randomized, double blind in the setting, the first analysis of time to cardiovascular death or heart failure hospitalization. They report this as the, the, the number of events per hundred patient years as just a way to summarize it. Uh, in the paper is in the, in the treatment group, it is 16.7 and in the placebo it's 21.9 per a hundred patient years, so 16.7 to 21.9, 140. One of the patients of 558 on the treatment had an event. And 166 of 5 47.

Reasonably similar sample sizes, so 141 and 166. The hazard ratio and the time to event analysis is 0.79. The two sided P values 0.041 sided would be 0.02 of superiority. The, the hazard ratio of 0.79 is showing the treatment did better. And notice that doesn't meet that endpoint by itself in the setting of, of the Huck procedure. If that would've been the loan primary analysis, it would be statistically significant. Now what happened to the other endpoints? It, it met 0.05.

So if all three meet 0.05, the trial will be considered significant. The total heart failure hospitalizations had 264 in the treatment group and 320 heart failure hospitalizations. Now that's adding across patients. It matters. Uh, how many patients have 0, 1, 2, and three, and so on? The relative risk in that analysis is 0.80, 0.80. Again, a benefit for the treatment. A 20% re relative risk reduction in heart failure hospitalizations. The P value, the two-sided P value is 0.12.

In that the third, which analyzed this subgroup of patients that, um, met this high need showed a and this was the, the end point of cardiovascular death or heart failure, hospitalization. Time to first event showed a hazard ratio of 0.79 Also. Exactly the same as the primary analysis. The confidence interval's a little wider and the P value's 0.07. So let's think back to the Hawk Bird procedure. Do all of them meet 0.05? They don't. The first one did the other two didn't.

Do two of them meet 0.025? No, actually none of them meet 0.025 and none of them met 0.0167. So according to the primary analysis methodology, the controlling of the experiment wide type one error rate, this trial is not significant. Statistical significance Was, was not shown. Wow. Again, a very interesting result that time to cardiovascular death or heart failure hospitalization showed a significant P value.

So for example, the, the credible, the confidence interval shown goes from 0.63 to 0.99. That's the 95% confidence interval. So what is the conclusion in the trial?

So I'm reading this, I'm looking at the data, I'm looking at the results, the conclusion and relevance for the the paper in patients with heart failure and iron deficiency, I. Iron supplement did not significantly reduce the time to first heart failure, hospitalization, or cardiovascular death in the overall cohort or in patients with transference saturation less than 20%, or reduce the total number of heart failure hospitalizations for placebo.

And in the little, uh, uh, figure they show that talks about the population, it's really very nice. It's the, the cartoon of the, the article, the conclusion says that iron supplement was well tolerated, but did not significantly improve outcomes compared with placebo in patients with heart failure and iron deficiency. Okay. So what, what does a statistician, and by the way, this is just me as the statistician. I, I'm, I'm sure other statisticians have a very different reaction to me.

So I read this and I really, really struggled with this from, from several points. And so let me sort of dive into the points. The first part about this is, it, it just, the, the scientific struggle I have that. We report trials as black and white. They're significant or they're not. And in this trial where the only reason it's not significant is because that overall cohort on cardiovascular death and heart failure, hospitalization was part of a, a multiple testing procedure.

So it was, it was incredibly close to being significant. But it wasn't, the conclusion of this trial is that iron supplement doesn't help patients, doesn't change the clinical outcome of patients. A Bayesian analysis say of that primary endpoint would say there's, assuming a non uh, informative prior would be a 98% probability that iron supplement benefits. On time to cardiovascular death or heart failure hospitalization. So any way you want to read this, there's gray area to this.

Every trial typically has gray area to this, but yet when we publish it, it's all or nothing. In this scenario, the conclusion is the same as if the data were identical in the two groups and the HA and the hazard ratio was one. Even if it showed harm, the conclusion would be the same, that the treatment doesn't benefit and really co conclusions say one thing, the treatment benefited or it didn't.

I think it's a gross simplification of a six year trial of 1100 patients, but I understand how we got there. And by the way, statisticians share some blame in in how we got there. We reinforce that hypothesis testing and type one error of 5% and you can't say anything if you don't reject a no. And that's the way we analyze trials this trial. I bet if you flip three deaths in this trial, it meets the Hochberg procedure.

And our conclusion to clinicians reading this article is that it benefits, it's either it doesn't benefit or it benefits. Now we hope. Anybody reading this dives into the data, looks at the results, thinks about other trials, thinks about the treatment of this, and makes a decision based on it. But I have to believe the conclusion from jama. It makes a huge indent into any clinician reading this article.

So. The black and whiteness of trials, uh, I just feel like is scientifically, I, I really struggle with it and I, I accept that as a member of the statistical community, we're probably partly to blame for this dogmatic approach to hypothesis testing. And its black and white and we're gonna come down hard on you if you don't interpret it any other way than that.

Okay. And now for those statisticians out there, I I, I have nothing against the Hochberg procedure, and I know that we design trials, we do FDA trials where, uh, 5% scenario, 2.5%, one-sided tests I is, is the standard. And, and we live by that and we do hochberg procedures. So I, I don't have anything to get that, but I really struggle with the likelihood principle aspect of this. The data in this scenario is exactly the same as another scenario.

The data are identical where this, this paper says it's statistically significant, and this iron supplement benefits patients with heart failure and iron deficiency. This 0.05 had, uh, the exact same data set and the likelihood principle that if we have exactly the same data in two different scenarios, our conclusions should be the same if you're a Bayesian. The posterior probabilities identical for those two trials.

It, the Bayesian machinery, uh, satisfies the likelihood principle by the nature of the Bayesian a Bay theor. So I really struggle with this part of it, and I get it from a type one error scenario. Being a Bayesian, I don't think type one error is the be all, end all, and it flips this result. For clinicians reading it, and I, I really struggle with that.

Okay. Now, given the rules we play by, and they knew the rules, they played by, they wrote them in this, this article, so the SAP lays out superiority. They knew that if they ended up in this situation, so if somebody simulated this trial and showed them this result. By the way, that's a huge value of simulation. If they saw that result and said, good, that's the result we want, great. But I'd be really surprised if that's the result they wanted In this trial, there are no adaptation.

Could this trial have been adaptive? Could we have seen that result? Could it have been bigger? Now, I recognize this was a six year trial. I. And it may be the funding of this, it couldn't have been bigger and this is just the way it is. And then the, they would say, the investigators would say, yes, that that's, that's the result we want. But I think it's one of those scenarios that had this trial been six months longer, had it enrolled 200 patients, would it have changed clinical practice?

Does this paper change clinical practice? Should it change clinical practice? Would it have changed clinical practice if the trial were six months bigger, 200 patients bigger? That's a little bit of the struggle here. And so, uh, I, I just bring out that I, I hate to say it, but is 1100 patients in six years, is it wasted? Uh, in, in the way because we do black and white.

Now we shouldn't do black and white, and I'll talk more about what would be shades of gray, but these are the rules that everybody plays by at this point. Journals play by this. We kind of know the rules going in. Could it have been adaptive? The other struggle I have is that this is about science. This is about recommending treatments to patients that that could potentially benefit.

And I would think if the truth of this is a hazard ratio of 0.8 on heart failure, hospitalization and um, uh, cardiovascular death, this is a clinically important treatment. So we only analyze data in the trial. We are stuck on that by the way. I think that's reflective of frequentist approaches. We analyze the data in the trial. We calculate the probability of the data is as extreme or more extreme than what we saw, assuming the null. That's the P value 0.04 it.

But this is, there's science here we we typically know more. Part of my struggle is what's next. This trial at I I if, if you're into giving adjectives, it's borderline significant. It's very close to being statistically significant. Do we need another trial that's six years in 1100 patients and get the P value of that next trial below 0.05 or below HUCKER stuff, or would 200 patients.

Potentially, I said, suppose that trial was bigger by 200 patients or 300 patients, would it change clinical practice? But no, we can't do that. The next trial designed would only analyze that trial. There's something incredibly frustrating about that. Now I want to give you a different potential scenario. Suppose this was a novel treatment. I assume that nobody owns the, the, the, uh, the, the rights to this. Nobody has patent life on a, a iron supplement.

And this is all about, uh, uh, treating patients. But suppose this was a novel treatment in heart failure and a company ran this trial exactly this and, and they don't get significant. And they go to a regulatory agency, they go to ema, they go to PMDA, they go to the us FDA, and they say, you know, we just can't approve it. It, it's not enough. Based on that trial, does that company need to run another trial of 1500 patients? Can we say, look, there's information there.

The next trial doesn't need to be as big. We're really close to approving this, but we just can't do it. And there are examples of this. Um, do we start over it, it seems, I. Bad science. Now, the FDA is absolutely doing this. There are scenarios where they use the results of one trial combined to the results of another trial. You can look up Rebi ota. It was approved in 2023. Fairing pharmaceuticals. There are multiple devices that have been been approved this way.

There's multiple scenarios that I know of that we're working with the agency or have designed trials where it, it uses the previous results recognizing that, boy, it's really close. We shouldn't need 1500 patients after this for approval. So I want you to think about just the, the medical community. In that scenario where we're so focused on single trials, what about combining the results together?

All of this, my biggest issue with this as a statistician reading it is I don't believe the conclusions. I think they're wrong. Now, maybe it's just me, but I think this treatment works, and let me give you a little bit of why. So when I read the article. And I see that first of all, the statistics is compelling to me. That's highly likely just based on the trial. If we're stuck on only this trial, a 98% probability for a clinically really clinically important event is really valuable.

In a scenario like this where this isn't about an FDA approval, which has its own regulatory standards, and I know we've, we've gotta go by this, but this is about the next patient that walks in the door. If it were me, I want iron supplement. I think it works. It's highly likely to work. But the other thing I thought is okay, you know, we, we do get type 1 errors. We do get scenarios where we get a hazard ratio like this in a confidence interval and the treatment doesn't work.

So, as a statistician, I wanna know what other information is out there and what do we know about this? Well, there have been previous large trials run and the trial nicely, uh, it, it's actually, you can't really find much about it. There's a little bit in the JAMA article, so I, I, I don't want to. Say there isn't. Uh, but it talks about the reason and still the uncertainty about this.

And so there is a trial, the heart FID trial, that it looked at time to cardiovascular death or heart failure, hospitalization. All three of the trials I'm gonna tell you about that have already been published, use that same endpoint time to first cardiovascular death or heart failure hospitalization. And that trial showed it had a hazard ratio of 0.93 for that endpoint with a confidence interval that went up to 1.06. So that probably had a P value. Something like, uh, uh, one-sided 0.1.

Uh, two-sided, you know, maybe 0.15 and maybe a one-sided 0.075. I didn't go find the article, but I, I found the summary of that. So high hazard ratio of 0.93. The Iron Man trial, a great name for the trial with an iron supplement, had a hazard ratio of 0.84 for that same endpoint where the upper bound of the confidence interval is 1.02.

Borderline significant but not that trial did not demonstrate clinical benefit, but hazard ratio of 0.84, so 0.9 3.84 and the Affirm A HF trial had a hazard ratio of 0.80. Confidence interval 0.98. They also res report heart failure hospitalizations for three of them. I'm sorry, for two of them, 0.80 and 0.74. The other primary endpoint in the fair HF two trial. So walking into this trial, we've got three other trials that demonstrate 0.9, 3.8. 4.80.

Uh, uh, for hazard ratio for that primary endpoint. All positive, one of them significant, one of them borderline significant. Uh, one of, uh, another one, 1.06 for the upper bound of this, and now this trial, the confidence interval for that endpoint is 0.79 with an upper bound of 0.99. That information altogether, this treatment as, as the read of a statistician, this treatment works. We are sitting there with four trials all analyzed separately. There is a meta-analysis published.

Does it move clinicians? So I don't know the answer to that. Other people can tell me that. But if I randomly pick up this article and I read it, it says, this treatment doesn't work. Boy, I, I, you know, and I know, uh, but boy and, and jam is an incredible. Journal. I, I know multiple editors for Jana. They jama they do an incredible job with it. And, and I know exactly how we got here and this is not unique. Uh, we, we see this commonly. I, I just, I struggle with it.

I don't think the conclusion is right. I think it's actually highly likely the conclusion is wrong. Do people read the conclusions? So that, that's my struggle. So what could the world be different? You know? Okay, Scott, so what, what would be different here? What? What can we propose different ways to do this? Well, let's suppose we didn't think of the trial as being this black and white.

Where we do a significance test, if it's significance, we all wave flags and we all celebrate and we publish it and it changes clinical practice. And if it's not significant, it doesn't change Clinical practice, you know, is, is suppo what would be different? What if the trial reported the posterior probability? The treatment is superior to the control and it doesn't put an adjective on it. It doesn't say significant, borderline significant, highly significant. Three asterisks on it.

We don't have to do that. 98% is an adjective and it allows somebody to consume the data if they want to only look at that trial. Okay? 98% probability for the primary endpoint of superiority. By the way, that satisfies the likelihood principle in that scenario, and we're overly stuck on type one error that it's significant.

What if that's the report and the little cartoon in the front of JAMA says, this trial demonstrated 98% probability that it benefits time to first cardiovascular event and heart failure hospitalization Now. We, uh, the first thing a frequentist is gonna say, well, oh yeah, but now you've got a prior for that.

An important part of this, the important part is where this sits in the science, and I think the, the article in the Journal of the American Medical Association shouldn't just report on that single trial. The trial should prospectively define a relatively non-informative prior. So that the data speaks for itself, that we, we know how to do that. It's common that we do that. And here that would give a 98% posterior probability. We, the trial prospectively defines a pessimistic prior.

It prospectively defines an optimistic prior. This would be relatively easy to do. It also specifies. A prior, based on the current summary of scientific information, it uses the meta-analysis that was published and it says based on the previous data, and it might even have several of these based only on trial one. Here's what you do. Other trial two. So if you don't like trial three and you like trial one, here's several priors. And somebody that uses those products.

Here's the probability, and my guess in this scenario is that this is 99.9 probability. This treatment is beneficial. If you use that summary of that information updated, which Bayesians do based on this trial, and allows the reader to judge it on this and never. Never says it statistically significantly affects the clinical outcome, but says there's a 99.3% probability this benefits, or nine that this trial demonstrated a 98% probability that it, it benefits. A pessimistic prior would be 94%.

An optimistic prior is 99.3. I'm making these numbers up, but I'm just guessing what they might be. And the prior, based on a summary of the other trials and this one together, that's where I think this would be 99.97% probability of benefit, something like that. As a statistician reading it, I'd be much more comfortable when I read this article. That, that's providing this advice to clinicians.

Where right now, when I read that article, I really struggle and boy, I hope they look at the data and I hope that they, they consume all of this information. It's hard it, the setting, so.

This is my read of a random article that sort of stuck with me of the results of this, and it stuck with me largely as somebody who spent 25 years in clinical trials doing publicly funded trials, privately funded sponsor trials, NIH, funded trials, patient organization funded trials, comparative effectiveness trials that. Uh, you know, I struggle with the scientific outcome of this. So we are not in the interim here we are at the end of the trial.

Thinking of it, maybe this could have been in the interim and the trial could have been adaptive in the world we live in, but also things about the world we live in. Could we do things differently as we're moving forward, uh, uh, in this and we get more and more results? Could this look different? So I am Scott Berry in the interim, and until the next interim, thanks.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android