You are listening to The Operative Word, a podcast brought to you by the Journal of the American College of Surgeons. I'm Dr Tom Varghese, and throughout the series, Dr Lillian Erdahl and I will speak with recently published authors about the motivation behind their latest research and the clinical implications it has for the practicing surgeon. The opinions expressed in this podcast are those of the participants, and not necessarily that of the American College of Surgeons.
Hello, loyal listeners, welcome to another episode of The Operative Word, the podcast of the Journal of the American College of Surgeons. My name is Tom Varghese, and I'm the host of this particular episode, and I am unbelievably honored and thrilled to be joined by two pioneers in the world of surgical research. First, Dr Haytham Kaafarani, who is a professor of surgery at Harvard, as well as, Dr Vahe Panossian, who is one of his research associates.
I will turn things over to them for them to formally introduce themselves as well as tell us about any disclosures for this episode? Dr Kaafarani, let's start with you. Go ahead. Well, Dr Varghese, first, I'm thrilled. I'm glad. I'm honored to be with you. You're a leader. You're somebody I look up to. Briefly, I'm a professor of surgery at Harvard Medical School. I'm the trauma medical director and the hospital director of safety and quality at Mass General Hospital.
And I'm a trauma surgeon by background. Very excited to be here. I don't have any specific disclosures related to today's talk, but I usually declare three things that I do contribute to UpToDate, and I get honorarium from that. But this is unrelated to this, that I served at the national leadership role at the Joint Commission. And this work is unrelated to my role at the Joint Commission before. And I do have research grants. None of them are related to this work.
Thanks, Dr Kaafarani. Dr Panossian, go ahead and introduce yourself. Thank you, Dr Varghese. I’m Vahe. I'm a postdoc research fellow at MGH trauma. I have no disclosures. I'm very excited to be here today. Beautiful. Well, thank you both for joining us. The article we're going to be discussing today was published in the October issue of JACS.
And the article specifically is the “Validation of Artificial Intelligence-Based POTTER Calculator in Emergency General Surgery Patients Undergoing Laparotomy: A Prospective, Bi-Institutional Study.” Dr Panossian is the first author of this article, and Dr Kaafarani is the senior author of this article. And, they did this on behalf of the POTTER Validation Group. Dr Kaafarani, let's start with you. Tell us what the heck is POTTER? Let's start there first and probably talk about AI.
Before we go deep dive into this article. Go ahead, Dr Kaafarani. Yeah, absolutely happy to.
You know, when I I'm just gonna kind of start by prefacing, I chose the name POTTER, which is an abbreviation of, of what it stands for and the methodology we used in AI to do this, this, but came from in 2017, my, my daughter was obsessed with Harry Potter, and it was always on my mind and I didn't realize it completely backfired on me because it's an application that anybody can download for free, and I get zero money out of on androids or iPhone platforms.
The problem if you search POTTER, it will never show up. There's so many other POTTER applications there, so you got to put “POTTER calculator” for it to show up. But what it is in a nutshell, it started as an idea, back in 2016, 2017 very early days of AI. I was working on risk prediction, I was using the classical methods to do that.
And I was on a dinner with a friend who is a professor of AI at MIT, and we were discussing this and he got he got he says, I got methods that will outperform whatever you do in a heartbeat and became a challenge. And then it became a project collaboration between Harvard and MIT. And what it is, is a in a nutshell, POTTER is we used a national database, actually, the National Surgical Quality Improvement, data from the entire country.
And, and we, we trained artificial intelligence called optimal classification trees, similar to how decision trees are made. But think about AI. So it's a reiteration, continuous reiteration with branching points. And we asked the question is the data from the entire country for emergency surgery? How can we predict in a nonlinear and interactive fashion, using AI, all the different relationships between variables to try to predict outcome ahead of time before the patients have surgery?
In other words, the concept behind it is really, really interesting and I had to learn it myself. The concept is the presence or absence of certain variables impact how much another variable impacts outcome. Let me give a very, very simple example. It's a simplification that, Tom, sure. The question is as follows.
If I have one patient that I'm about to do a colectomy on and I say this is the risk of, complications, is x if they have hypertension, then all the additional risk that comes to this patient from the average person without hypertension, it's just due to the hypertension. Right. Now imagine that same patient now has liver cirrhosis and hypertension.
You can imagine that the relative contribution of hypertension to the complications becomes minuscule because liver cirrhosis takes over a lot of the risk. Now this is too valuable. Imagine the same concept blown up multiple variables at the same time. And you can see it becomes very difficult for the human mind to just keep track of all of these interactions between variables. So that's what AI does.
So we created we did this project and we created this calculator pretty much that you can download on your phone. And it's asked questions. And based on your answer, it takes on a different branching point of a different variable and just keeps going until it predicts the risk of mortality and the risk of total complication, individual complications for patients. So that's what POTTER is in a nutshell.
Yeah. No. So from the, the inspiration from your daughter to Potter, Harry Potter to that being found on Google search, that's fantastic. Haytham. That's great. You gave us that perspective. But yeah, no, this is fascinating. And I really applaud you guys for being pioneers in this space. I mean, obviously now that ChatGPT and other AI devices are all around us, it's become more part of our normal day-to-day vocabulary.
But the fact that you stumble upon this years ahead of all of us, I applaud you for that. Let's deep dive into the article. I mean, so people have heard about the why. So this essentially was another validation study that you were doing with a different population. Dr Panossian let's walk us, walk us through the study itself.
So from the article, obviously, the study design was patients undergoing, an emergency exploratory laparotomy for non-trauma indications at two medical centers and was from between June 2020 and March 2022 were included. Talk to us about how you were able to structure, the, analysis and then what the findings you found from the study. Right.
So the main reason we did this study, first of all, was we had some retrospective validations of POTTER, but what was remaining was actually having a prospective, data collection and validating it in real life. So, as you said, we included patients who underwent emergency ex lap for non-traumatic indications at two centers here in Boston. And we included around 260 patients. And we gathered all the variables that POTTER needs to come up with a prediction.
And we also collected the outcomes that the patient had in real life. Which were 30-day mortality, septic shock, pneumonia, prolonged mechanical ventilation and bleeding requiring transfusion. And we put all of the data in POTTER and checked what was the POTTER, prediction of those outcomes. And then we use the C statistics or the area under the curve to calculate what was POTTER’s performance and predicting the real-life outcomes.
And what we found was, it confirmed our biases that POTTER performed excellently. Mortality had a C-statistic of 0.90. And for the other outcomes it ranged between 0.80 and 0.90. That's incredible. Talk to us about the logistics. I mean, I want to take a step back. Because obviously to do something like this when you're doing it as a prospective study, you have to do a workflow integration that doesn't interfere with your day-to-day work.
Like, talk to us about how you set the study up to be able to run this. Like, I'm fascinated by this because I'm sure our listeners, probably a lot of them are also thinking about, oh, AI is out there, what the how the heck do we do this? But talk to us about how, Dr Panossian, how you went about workflow.
It should to so to ensure that we had a quality data collection in a prospective manner, we included all the team members in the trauma research program, and we also screened day to day, who is going into emergency ex lap, and identified them and checked what their preoperative variables for their labs, collected all of those, and followed them up for 30 days, to so have their in-hospital outcomes. So essentially you were shadowing the teams at all time.
I mean, you also have a surveillance network out there is like to perform a study like this. Correct? Very much. Yes. I'm gonna say, you know, Vahe and the rest of the team members were joining the sign up trials for, the trauma teams in two hospitals for two years. Wow. I just, I mean, but I think that the positive about this is I, this is where, like, I was drawn to the study and, you know, as a disclosure, of course, Haytham and I have been colleagues for many, many years.
And so whenever a group of study comes from this group, immediately my eyes are drawn to this, but the fascinating aspect I thought about is, is that I love this concept of, I like to call them, you know, there's a term in economics called field experiments. You know, and what that is, is real world studies, real world applications. Obviously, it's cumbersome to set up and everything, but this is kind of, to me, a great example of essentially a surgical field experiment. Wouldn't you say?
I would say that that was the, I mean, we've had enough we had enough data to be confident with the performance of POTTER from the like national database. But then we were faced by we were faced, Tom, with two questions when this was presented in other forums, whether big surgical meetings or smaller forums. And the two questions was, how does that compare to the surgeons gestalt of predicting risk?
And the second question, well, this is nice, but you know, you did it of a of a database, blah blah blah. How does it perform in real life when the rubber hits the road? So the we did the, you know, the last three years. These are the two areas we focused on, which we're happy to talk about the gestalt one, I know that's also a lot of interest, but the question is if we actually do the prediction upfront, like we're not doing a retrospective way, we just do the prediction upfront.
We follow them for 30 days and see what happens. Can we find the same things? And we did. Yeah. So it's a field experiment as you said. Yeah. Yeah. It's I mean like I said I mean to me that was where my mind was immediately drawn as like this is a phenomenal opportunity because, you know, all of us have been attacked by, you know, using large retrospective databases. And of course, you get enough variables and you're potentially able statistically to prove anything you want.
But that but the thing I really love about this study is that prospective nature, Dr Panossian, a couple of questions to you. So, from the I'm just going to read the sentence, from the paper, and I just want you to react back to, this. You said that, in the paper, it said “the primary advantage of using a prospective design in this study was to validate the model in a real-world setting with data the model had never encountered before.
This prospective approach also ensures the model's robustness against... you called this “‘data drift’, which occurs when a model depends on variables with properties that may change over time. Can you reflect back on that? I mean, I think that that's it's probably a term that a lot of people aren't familiar with, “data drift,” but kind of talk to us about why it's critically necessary to do exactly what your group did by doing this prospective study. Go ahead. Right.
So the first reason, we’re dealing with the patients that we collected for this study and the actual design of POTTER. And there was essential to make POTTER objective on the data that it’s never seen before. And it had to come up with the predictions in real life and compare it with what it was designed for initially, in the ACS NSQIP database. And, since the POTTER the POTTER algorithm was designed, we have made some iterations, updating it with the ACS NSQIP data.
But those data variables could change over time. Sort of the patient population and in a regional manner. Each region has a different characteristic of patients. So really we're interested in knowing how POTTER would react in an academic center, a referral center with where baseline the patients come in more sicker, more critical. And I'm also curious on Dr Kaafarani and his thoughts, on this. And the data drifts, that happens. Yes, absolutely.
I mean, the data, the data drift definition, if you want, Tom, like in a very statistical purest way, is when the statistical properties of data are changing over time, which can make the data models, if you want, less accurate, it can occur also in AI, in machine learning and but it also happens in the classical methods. We do this. But the idea is the following: when POTTER.
So back in back in the first study to the initial project with the way we did, we took, you know, unlike this, you know, this paper we have 361 prospective patients. We actually had about 3 to 4 million patients. Out of them, about 500,000 were emergency surgery patients. But what we did, we divided that data set into what we call the derivation and the validation. Right. So that was the main validation, meaning we trained it. We trained it on about 80% of the data.
And then we said, okay, now you have the algorithms, you trained on it. Can you predict the other 20%? But there's an inherent bias in that, meaning, well, this is how the dataset itself was created. So, if we trained it on it, is going to predict very well on it too, right? Yeah. Not what, what Vahe was trying to say, you know, with time and with a change of how the nature of the data set, we said now we're going to create a completely different data set. Has nothing to do with NSQIP.
And we're gonna see if POTTER still performs to its promise. And that's what it did, even though the number of patients, much less, took us two years in two centers, 361 patients. But it proved that on a dataset, it's never been trained on, it's still performing well. Yeah. You're right. And I think that that's the fascinating aspect.
I think as people are trying to wrap their minds about AI applications that, you know, that's the unbelievable potential, you know, rather than doing the traditional way of, you know, a retrospective analysis, you know, we do a regression analysis, and then we have just a static risk calculator. I mean, I think that this correct me if I'm wrong, the AI application, this is what it takes to the next lead is it's a it's almost adaptive. That is like whatever real-world information is coming in.
The risk assessment is performing robustly. Correct? You got it exactly right. It's adaptive. It's reiterative. And POTTER has one more additional characteristic which is really important, Tom, and if you’ll allow me maybe to point that out. Yeah. Go for it. It's reiterative adaptive, meaning it's continuously learning from its mistake. That's what the adaptive it's like the more mistakes it sees and it knows about, it will keep correcting.
That's 1. 2, the reiterative, which is really important. Meaning I mean you know, I you can download the app and see that when you answer a question that really analyzes the entire database that it was trained on to give you what the next question is. So if you answer a question, let's say is the patient in the hospital, is the patient intubated? Not intubated? The yes or no dictates what the next set of trees, is going to happen.
And then the next question is, does is the patient, you know, have prior history of something? If the answer is yes, it goes into totally reiterate, just reanalyze this again and does it. That's why the third characteristic, which is really important, and I do want to take the opportunity to point it out because not all AI is this way is POTTER is transparent and that is a very, very big deal. Why?
Well, because a lot of the AI methods are what we call black box, meaning, but almost like a religion, you just say, I'm going to give you the data, you do your magic and you give me the output and you just need to believe in it right? The POTTER does not do that. POTTER, because of the optimal classification trees methods, You can follow the logic of that AI. Why is this important?
Because, AI, one of the things that a lot of people are pointing out day in, day out, AI has the risk and the potential of bias. If your database is biased and has disparities, let's say if the care in the country that we provide to people, minorities, whether African-Americans, whether Hispanics, you name it, that, that that if you train on a biased database, you can incorporate that bias into your algorithm without ever knowing. Correct. What this is.
I'm big advocate of transparent AI because you don't want this to happen, because you will pretty much consolidate bias and algorithms that we are trying to tell people to how to do care. So transparency is really an important characteristic. POTTER does that. You will follow its logic. You will see why it's doing that tells you. The reason I'm doing this is because those are the characteristics I took into consideration, one after the other.
And that's the number of patients who died out of the total number who had the same characteristics. And this is why I'm giving you this prediction so you can follow its logic. And I think that's really, really important. So we don't we don't create problems. We don't run down the line. Amazing amazing amazing perspective. Well, towards the tail end of this interview, let's focus on, two things. The first thing is I want to read, two sentences from your paper.
And then really talking about what's next. So from the introduction, from your manuscript, you started out by saying, “Postoperative complications occur in 15% of the 19 million surgeries performed yearly in the United States, resulting in significant morbidity and mortality and additional health care costs exceeding $31 billion per year.” I mean, I just-- -- massive amount in terms of impact, “Predicting postoperative outcomes is critical for appropriate counseling.”
I mean, that's the reason why we did, this study, “as well as resource allocation and benchmarking of quality of care.” And the question I really have is about that benchmarking, because the conclusion of this manuscript were really was eye opening when you said that this is the first prospective study, obviously, showing that AI-powered surgical risk calculator POTTER accurately predicts the postoperative outcomes.
But really, it's more about, you know, this smartphone-based application, like, how do we make it widely available? Right. Like, so the question really comes is, what's next for your research group? Are you planning on doing this on other, you know, prospective studies or, maybe another question to Dr Kaafarani is like, what's the health care policy implications for something like this? I'll start with Dr Panossian first, and then we'll go to Dr Kaafarani.
Go ahead Vahe. From my end, I think, the next two steps for POTTER would be updating it with new data from the ACS NSQIP, making it more robust, with the more recent data. And also, having a comparison of non-operative patients. So the very nature of ACS NSQIP is all patients had surgery, right? So when we're interacting with POTTER, we're interacting with a patient sample who actually had the surgery. So we cannot really decide should I do the ex lap or not.
If the patient asks, what are my chances of dying if I don't do the ex lap? Well, I don't know. POTTER cannot answer that. So, I think in its current version, it can't answer. In its current version, it's kind of, but I think it would be interesting. And, validating it in a patient population that didn't have surgery, who were managed non-operatively. And seeing how it performs. That's a it's incredible. Dr Kaafarani. Yeah.
I mean Vahe is very insightful in this I mean that's one of the downsides actually of POTTER is it's been only trained on patients we operated on. So when people tried to use it to say should we operate or should we not operate it, it's really was not designed in that fashion. So I my advice usually is only use POTTER if you already decided to operate, you know, because like for example, the highest prediction of mortality in POTTER is 73%.
And people sometimes try to use it for patients who are clearly we should operate on because they're gonna die either way. And they're like, what, 73%? This is a 95% mortality rate. Well, because POTTER does not because they're not trained on patients, we never operated on because they're too high risk. So that's one. But to go back to your question, where do I see this big picture of health policy I mean, I have a small, small, very tiny dream and it's very convoluted.
What I’d like us to see is the one problem you're dealing with on a very daily basis in your own leadership roles. Tom, which is you bring data to your teams and they, you say, well, your risk of infection is much higher than your compatriot or the national average, you know, can you tell me what's going on? And everybody's response is, well, my patients are different, right? My patients are high-risk. I get and you know what? They're not completely wrong.
I mean, we do know the patients who take the surgeons who take the high-risk patients. And it's not fair that if we, you know, hold them to the same standards because it's that it's, almost disincentive for them to take care of the sicker patients who probably need the surgery the most. So, the risk adjustment is always at the problematic component.
And every time we try to measure quality at a very big level, and I and we, we do a pretty good job with a lot of the methods we have, whether it's Vizient through administrative database. You know, I don't think it's as good. But NSQIP, for example, has a very robust risk adjustment model. But they're almost like they're the Priuses of the car of, you know, the equivalent of the Priuses in the cars. And can we get a Ferrari to do better risk adjustment. And I think that's what AI can do.
We started dabbling, from a research point of view, can you use AI to benchmark to better risk adjust where you are actually taking into consideration these not visible to the human eye interactions between all the variables. And we we've published a paper that you can conceptually do that. There's no question. The concept is we proved the concept. But I think we're still in the very early days of using AI in optimal classification trees for benchmarking. That's what I'd like it to use.
I'd like I'd like us to use AI so that we can better compare apples to apples when we're benchmarking quality of care across hospitals, across systems, and across individuals. Well, I can think of no better way to wrap up this episode. I mean, it's really a call to action for all of us. Opportunities for all of us as surgeon leaders to engage. And, optimistically, I think the future is very, very exciting. Dr Kaafarani, Dr Panossian, thank you for joining us today on The Operative Word.
The podcast of the Journal of the American College of Surgeons. Thank you for listening to the Journal of the American College of Surgeons Operative Word Podcast. If you enjoyed today's episode, spread the word on social media by using the hashtag #JACSOperativeWord. Subscribe to The Operative Word wherever podcasts are available, or listen on the American College of Surgeons website at FACS.org/podcast.
