¶ Intro / Opening
Apply now for the Morris Fishbein Fellowship in Medical Editing. This is a unique one-year fellowship offered by JAMA to introduce physicians to all facets of editing and publishing a major medical journal. The application deadline is January 5th, 2026. For more information, including how to apply, I'm Yulin Schwinn, Associate Editor of JAMA and JAMA Plus AI, and you are listening to JAMA Plus AI Conversations.
My guests today are Dr. Jonathan Chin, Associate Professor of Medicine at Stanford University, and Dr. Ethan Goh, Executive Director of Arise Research Network at Stanford University.
¶ Evolving AI Models and Diagnostic Impact
It is great to have you both again. It has been a little bit over a year since your very popular article in JAMA Network Open, Large Language Models and their Influence on Diagnostic Reasoning. That's where we really saw an AI chatbot outperform. physicians, right? And so I want to hear from the both of you, your thoughts after this year and any updates you've had in your research area since this very big and popular article. So I'll start with you, Jonathan.
Sure, great. That was surprising findings back then, but it just begs more questions, essentially. So we've had follow-up studies talking about management questions. Diagnosis is a classic test question, but...
Most of what you struggle in practice is usually not diagnosis. Sometimes it is. It's usually like, what do you do? Patient had a side effect. Should you keep going? Should you keep him in the hospital? How would you counsel them? And there we've also found humans and computers can do well together, but still often not as good or no better than the computer. by itself and a broader theme. There's a lot more we can talk about.
is some of those studies were done with GPT-4. For those paying attention, that's two-year-old ancient technology. What are you talking about here? Is it GPT-5, multiple reasoning models? And if we're in this bizarre world, the pace of technology is moving way faster than the peer review cycle. So we have...
multiple follow-up studies in preprint, just stuck in peer review for months because that's how long the process takes. And for example, we've tried multiple reasoning models as well. These are not chatbots that just...
fill in the blank, autocomplete and steroids. They kind of talk to themselves in a chain of thought. And it turns out that looks like they can do even better on some of these things and really begs a lot of complex questions about the right role for humans and computers in the coming ages of medicine. You talk about these multiple reasoning models like everyone knows about them, but can you tell our audience what that actually is and what their uses are for in the clinical arena?
From a user point of view, you probably wouldn't notice a difference if you're on Cloud, if you're on Gemini, if you're on ChatGPT. It will just start to just default to those. You just ask your question. And now, like... GPT-5 actually isn't that much more powerful. It is, but mostly what it is, depending on what you ask, it will automatically triage your question. You just needed a quick autocomplete question because you want a fast answer, or...
This is a more nuanced medical reasoning dilemma that's not so obvious. You can't just look up the answer. It can go into reasoning mode. I'm way oversimplifying, but it basically, it will talk to itself in the background, which just like your team, right? You go back and you discuss a differential before you go back to the patient.
And then you're trying to look smart by the time you get back to it. And so by the time it gets back to you, you notice how sometimes you ask these things. It now takes 30 seconds for it to get back to you. It's because it's talking to itself for a while. Sometimes it's looking up additional sources. And it just turns out with reinforcement learning where you give it targets to aim for, it seems like it often does substantially better at complex reasoning tasks.
¶ Addressing AI Harms and Physician Training
And Ethan, last year in our conversation, you mentioned how GPT-4 and these kind of models have already leapfrogged research and obviously diagnostics. And so what do you think about these new type of models that... are now talking to themselves? How do you find it being an engineer yourself? Yeah, very exciting. I think the broader trend, as John mentioned, these models are getting better and better and saturating one benchmark after the other.
And I think some of the obvious research directions that it opens up is, as you said, everyone, when we publish those results, was like, hey, guys, don't leave us hanging. AI alone outperformed AI plus doctors. But surely we know that's not going to be how it will be. And already some results have been showing that.
do different things, such as training doctors to use AI to use better, or for example, being more intentional about workflow design, then you can have better outcomes. But I think more broadly, one of... The studies that we're really excited about, if you read all the news media, right, about the lawsuits that OpenAI are having, is no one can really answer.
how much harm actually is being produced by these outputs, right? Or doctors or patients using the AI output. So I think some of these companies are claiming that about up to 100 million patients are being affected by doctors using AI-like too. So that's astounding not to mention all the patients. using it one other thing that's worth noting that i've been really fascinated about is that as these models get better and better
paradoxically, when they produce harm, it's more likely to go to an impact. Why is that? Because doctors feel like, oh, everything click yes, accept, review, accept, minimal edits, right? And so when it... produces a harmful output, the doctor might find it a lot more challenging to pick that up. So like everyone else, we've been basically developing our own AI doctor-like systems. And we were like, hang on, this obviously is going to be quite good, right?
But what about the question, right, in trying to implement, operationalize, can we actually articulate how often are harmful outputs going to be and how severely harmful are those outputs going to be, right? Because that is ultimately going to drive, I think, some of these outcomes. And we wanted to be able to articulate that. So some of these studies are things that we're actively thinking about as well. And so you mentioned training doctors. How do you then train doctors?
the ones that have been doing this work for decades, and then the new doctors that are coming in, how do you train them to use these AI tools safely and effectively? For better or worse, I've been named the first Stanford Director for Medical Education and Artificial Intelligence. So I guess it actually is my job to figure this out. It's really complicated because there's all different tensions at play. Ethan, briefly, we have another active study, again, stuck in preprint peer review.
because it takes forever. I was a very bright medical student, Celine Everett, who led this one, from tools to teammates. In our prior study where we had this really bizarre result where the computer seems like it's better than the doctor reasoning, that's crazy.
What actually happened is a third of the doctors at the time had never touched a chatbot AI system in their life. Another third maybe used it once or twice. They clearly did not know what it was. They didn't even understand. They were treating it like Google. They're just like searching differential for eye pain. Did you know you can copy?
copy and paste the entire story in there and ask follow-up questions. Maybe now that seems more normal. Three years ago, that would have seemed bizarre, right? You would not have thought you could interact with a computer that way. So now in this updated version, we train the doctors. It teaches them in real time how they can use it. And then they do better. That combination is more effective.
¶ Understanding AI Bias and Mitigation Strategies
If that's the case, who should go first, the AI or the computer or the human in that case? Do you have AI write your first draft for you and then you patch it up? Or do you write the first draft and have AI grammar check for you? It turns out both can be good and both demonstrate sycophancy and anchoring bias. Whoever goes second tends to just agree with the first. And so realize if you want an objective answer.
Don't tell AI what you're thinking. Otherwise, it's just going to agree with you because it wants you to like it. So I learned that one of the biggest reasons why the chatbots are so agreeable with you is because in the... testing that they did, users would rather them say, the chatbot say, two plus two is five.
You're amazing. Like you're a genius and you invented a new equation. Like you're so wonderful. And it's an issue because it feels great for someone to say you're correct. So it's this like reinforcement of wrong answers either way. So I understand why AI says yes, because of the feedback that people like it. Why do we trust AI so much?
I think there are two parts to it. One, we need to remember that these things are trained to be super engaging because the metrics from these companies are like, great, if people are using our product more, that helps our business, right? So that's the first part. Second part, I think it was a broader commentary. If things are generally correct, like you go to a doctor, you expect...
This expert is always saying correct sounding things. And who else would you have to turn to bearing in mind that most people don't have a doctor or so on in the family? So in that instance, CatGPT might sound like it's giving to them the best advice. And when things are 95, 97, 99% correct? Are they going to miss like a 1% of things that are wrong, right? So I feel it's a lot of that, but that's my take. I don't know if Jonathan has a take to that as well.
Actually, the specific word is automation bias, right? When something is right 90 plus percent of time, it's just a supervision fallacy. We're here to double check the AI's work, right? When it's right that often, you eventually just will stop paying attention. It doesn't pay off to double check that often when it's right so often. And this is...
This isn't AI specific. This is all sorts of processing technology that is a real issue. You follow the same guideline protocol. I'll give a concrete anecdote. In our children's hospital, right, we take care of a lot of babies that are born. Our OB-GYN residents are taking good care of them. And then one day, like, Epic went down, right?
And people had to go back to like pen and paper to get in orders. And the new baby was born. Yay. So go take care of the new baby orders. And the resident was like. Wait, what are new baby orders? I think there's some vaccines. I think it's like a vitamin K shot. Like they don't remember anymore because the computer always just took care of it. And that is like a sensible thing to do. But you can see what happens when we become dependent on technology for something. It enables us.
do a lot, but technology availability can also cause issues when we're so used to trusting it. So then how do you stop that automation bias? What are the checkpoints that you need? Again, as you were saying, we do it in everything. It's not just AI. How do you prevent that?
So we have to be very intentional about that sort of whole workflow design, right? Like how is the prompt suggested to the doctor, to the patient? What are things that call out to them? We can use design to basically know whether things should be worth paying attention more.
Every human has limited sort of cognitive attention. There are only so limited things we can focus our attention on. And we have to be very intentional about that. So one example, right? If we know that citations are known to be hallucinated more often than not for this instance, right?
Could we on a product call out, hey, doctor, please check the citation, right? Because they're known to be wrong. There are those little things that we can do, and that's very much unexplored territory. So that's my view about how I think a lot of these could be somewhat mitigated.
Sure. I actually think what we're doing right now, it's a little bit education or outreach. I may be overly optimistic about what it can do, but you can start with also disclaimers. Most of our colleagues like Roxana Dineshu had a recent say, like, more and more of these AS, they're not even bothered with the disclaimer. Hey, don't use this for medical advice. They just don't even bother saying that. because I know nobody's paying attention anyway. So...
Then you need some calibration and feedback. Hey, everybody, these are cool things, but realize sometimes it goes off. That's very powerful. You can use it. As long as you understand how they can go wrong and get that feedback once in a while that it's off, it can be okay. Balanced by, this is sycophancy, right?
These things are not designed to be right. They're designed to be engaging. They're designed for you to keep coming back. And as you said, if they give you a wrong answer, but that makes you happy, the incentives are all warped over here. And I would like people like those listening to this podcast right now, the healthcare ones, have the right. values to drive where they should go. Tech companies should not be driving how healthcare is driven. Yes.
What are you teaching your students? What is the curriculum of the next year? Considering you just said that ChatGPT 4 is old news, we're going to get ChatGPT 10 in maybe three weeks. We don't even know. What do you do? What do you do in this environment?
That really did happen. Our first paper, ChatGPT3, and then it was again review, and then three weeks later, the editors said, sorry, GPT-4 came out. Your study's obsolete. Redo the whole thing. Oh, my goodness. It's quite a dilemma. We have it online. It's publicly available. We had the first AI and med-ed symposium. at Stanford back in June, and we're trying to make all our materials broadly available. And it is tough. We have this adoption curve. There are people
probably listening here who are early adopters, right? You probably already used some of these tools. Do you know what also notebook LM is? Have you tried a conversation with a chat bot? Have you used open evidence? I bet a lot of people on this podcast have. If you haven't, you should try these things. And at the very basic, we're trying to get...
get everybody on the level playing field. At least log in and use these. Otherwise, your peers are going to race way ahead of you, and that's very inequitable as well. But know what a confabulation a hallucination is. Know that even open evidence is so powerful, and yet it will give you different answers if you ask the same question, and that's problematic and dangerous. And we have to calibrate ourselves that.
Most people haven't bothered to figure out a lot of stuff. I talked to a community of two, 300 physicians and gave a grand rounds. And I talked about this thing, which was just hallucinate. It'll confabulate references. And some were shocked.
They were shocked. That is like two-year-old news. That should not be news anymore. But you can realize a lot of the community don't even realize some of these basics, and they are going to hurt themselves. And they're going to hurt others because of these very subtle, very easily missed caveats.
When the thing is so seductive, it looks so compelling that we have to empower people to know the difference. I do like that word seductive. It very much is, especially again, it speaks so eloquently, but it's very agreeable. Right. So it seems as though it's authoritative, but it really is this kind of cycle of patting you on the back. But having it be a very authoritative figure that pats you on the back is very different. So I think that's really interesting.
Ethan, I want to ask a follow-up question with you. I know that Jonathan had said that like a third of medical students had not used any of these tools. Why do you think that is the case? Who are the people that are not using this at all and why? What's going on, especially generationally? I think that some of it is just medical students are busy. I think there's so much to do with the curriculum. And what is another new tool for us to learn?
So they're just trying to catch your breath, right? I think that's at one end of the spectrum. And then obviously what I've heard anecdotally is for some more senior physicians, they already have their own workflow, right? Up to date, like our parents' generation, right? Going from... paper maps to what's this app? What's Google Maps? I think it depends on the sort of demographic as well. So that's my two cents.
¶ The Future: Agentic AI and Safety Research
So then what are you two doing in the next year then? I'm hoping that we have this maybe conversation every year. And now I want to know what will you be doing in a year? What do you think is going to be the biggest thing next year when we have this conversation?
If chatbots three years ago was like, what is this thing? It's barely usable now. Whoa, it's like taking over the world. I think it's like agents, agentic AI is the thing. It's actually really not usable right now, but you can see where things are going in the next few years. We're building the prototypes. We published an article, Med Agent Bench. I don't want advice. I don't want you to talk to me. Chatbots say things. Agents do things.
Go check the patient's potassium and just replace it as needed. I don't want you to send me a report. I want you to just do it. Just take care of it, basically. That is clearly a possible direction. The systems don't work well enough that you would rely on them for medical care yet.
But we're showing it's become possible. We can evaluate it. We're now starting to build out the prototypes. And it's not a technology problem. It's like a governance, safety, reliability problem. We have to figure these things out. And we're starting to assemble those pieces now. And Ethan can talk more about even nearer-term work on benchmarking.
So 2026, I think really exciting. For me, at least, there are two big parts, right? One is on the Arise side, one is on the study side. So Arise, as you're aware, is a research network, Harvard, Stanford Labs, John, Adam Rodman. What we've seen a lot is that... A lot of people...
don't really understand about the research implications. They want to understand how to use this leader tools, as John mentioned. So we're actually launching a new course at Stanford that is going to try to share some of these implications and concepts to help health system technology or adoptees technology.
And we are also looking to put a bit of a research roundup, right? I think a lot of times people don't too much read the research. So we're going to try to help translate a lot of that, whether it's a report, whether it's demo. So more of that to come in 2026.
And then on the other study side, as John mentioned, David Wu is really amazing fellow working for group is leading a study really to help us quantify how harmful these sort of chatbots could be, because we believe that's really one of the most urgent questions right now.
Again, because if you were the model company or the solution provider, everyone is just shouting, right? Look at how good we are. Look at how amazing we are at USMA. Look at how much we are better than doctors. Someone needs to at some point hang on, right? When things go wrong, how badly can they go wrong? And we firmly believe that the whole field is under-indexed on that right now in the race to see how amazing some of these clearly powerful models are.
So I think broadly, those are the two trends. And then also for on the study side, right? I think a lot more work around AI plus physicians, right? AI plus doctors, broadly, right? We do believe that we will be better and we want to be a lot better. We should be a lot better. but it's about how to drive those changes, as John mentioned, right? Training, intentional workflow design, UI UX, product design, we're working with.
cognitive psychologists as well, Laura Zouan, to really figure out some of these things as well on that sort of AI plus doctor side of things. So lots of exciting things to come. I'm really happy that you're doing this work, especially in the area of looking to see what harms there are, because there's always the publication bias, right? Everyone wants the shiny, this worked really well result.
doesn't showcase the dangerous side because it's scary but it also maybe is not again sexy as it is to say this product that's new that works and the tech companies are very big at promoting that and so we need research that is showing the kind of underbelly. And so I'm really happy that you two are taking it on. Thank you again for this wonderful conversation and we'll be in touch and hopefully we'll check in next year. This is great. Thank you.
This episode was produced by Shelley Steffens at the JAMA Network. You can find a link to the manuscript in the episode's show notes. To follow this and other JAMA Network podcasts, please visit us online at jammanetworkaudio.com or search for JAMA Network wherever you get your podcasts. Thanks for listening. This content is protected by copyright by the American Medical Association with All Rights Reserved, including those for text and data mining, AI training, and similar technologies.
