Welcome to tech Stuff. This is the inside View. I'm os Vloschen here with Cara Price.
Hello, so as I'm very curious to know more about the story you've brought me this week, since it's a topic we discussed a lot on this podcast.
Yes, so today I've got a story about AI in healthcare, specifically AI and diagnosis. I spoke with doctor Matthew Lungren, who is the chief Scientific officer for Microsoft Health and Life Sciences, about this blog post that Microsoft recently published with the title the Path to Medical Superintelligence.
Do I want to know what medical superintelligence is? It's more big than just regular intelligence. But I actually heard about this study. It was everywhere, and if I remember correctly, it was that the AI were better at diagnosing than doctors.
Right, Yeah, that's right, In fact, four times better. There was a headline in Time magazine which really says it all. Microsoft's AI is better than doctors are diagnosing disease. Special shout out here to Elliot Fishman, who's our old friend. He's a professor of radiology at Johns Hopkins and he runs this fascinating email group that discusses new developments in AI. Matthew Lunger and I are both members of this group, and Matthew is also one of the authors of the study.
What kind of doctor is Doctor Lungren?
Like Elliott Fishman our friend, he's a radiologist by training and has a public health background. He was hired at Stanford where he started using machine learning to analyze large data sets. Here's Matthew.
Eventually my lab grew into a very large AI center at Stanford, which bridged the computer science department in the medical school and kind of saw translation of newest techniques into healthcare applications accelerate. Taking that work further, I went to Microsoft on sabbatical at Microsoft Research and realized that a very similar opportunity was there in big tech if you could start to connect the latest technology to problems
in healthcare. And so that's how I came to be here, and that's kind of what I still do all day.
And Matthew is also one of the authors of the Microsoft study.
I believe that the human expert plus these expert systems together will ultimately deliver better care.
No matter what.
Profession you're in, there's always a gray haired person that has, you know, in some sense, seen it all and kind of compressed that into their brain and their pattern matching in a way that is just faster than folks that don't have as much experience. And that's true anywhere, but certainly in medicine, right. I think that the assistance or ability of AI to now sort of connect dots in ways that maybe can achieve that wisdom or that experience and bring that to the surface.
It's kind of an unprecedented time.
The only exceptional performance I four times better than human doctors. One of the things I found most interesting about the study was that it wasn't just one single AI model doing a diagnosis. It was a whole team of AI models that were able to talk to each other in order to count with hypotheses, order tests, and ultimately count with a diagnosis.
So multiple AI models seems a little bit unfair.
Yes, and in fact we talked about this. The doctors in the study were not allowed to call specialists to help them with their diagnosis, but the ais were allowed to talk to each other. So doctors are not going to be made obsolete anytime soon.
Well good, because I have a physical coming up and I don't need four AI models being like, well, this girl got real big this year.
Now, as you and I already know, people are already using AI regularly to diagnose themselves. In fact, I think more than ten percent of the overall CHATCHBT traffic is around medical stuff. This is not always music to the ear of doctors, so it was interesting to look at an example where this is actually an AI build built for doctors and to work with doctors rather than patient facing.
And the other interesting thing for me, which we talk about with Lunger, which we'll get to, is how this idea of multiple ais talking to each other can simulate the experience of the best hospital systems in the US for people who otherwise might not have access to these panels and experts.
I can't wait to hear what you learned from him.
Well, here's the rest of my conversation with doctor Matthew Lungren. So you're a trained doctor, and I want to start with the basics, which is diagnosis. I'm not sure when the last time you made a diagnosis on a patient was, but I'd love to hear from you as a doctor. What is the process of diagnosis?
Yeah, I mean it depends quite a bit on the specialty.
But as most people know, the classic image of a physician, right is to speak with the.
Patient, kind of do a Sherlock Holmes kind of thing.
Everyone's seen the shows like House and Things are kind of sensationalized sort of the approach.
But really there's a lot of unknowns that you have to tease out.
Right.
You have to interview the page, you have to obviously interpret labs and other information, and you have to start to narrow things down and order appropriate tests. Try not to chase too many what we call the zebras, but keep those in mind in case you're dealing with one, and.
The zebra would be the classic House episode.
Right, yeah, right, Well every House episode is a zebra, which actually has some relationship to the study we're going to talk about today. But in general, it's more common to have an uncommon presentation of a common disease than in a common presentation of an uncommon disease, if that makes sense.
Right, right, right, And this kind of relationship between AI and doctors has been going on for a few years. I remember reading a great piece in the Niyoka about how one of the challenges for AI was that the best doctors can't actually tell you in words why they're good at making diagnoses.
That's right. It's interesting.
I think there are things that humans have, many cotton adiases that are well undo and I think you know, keeping that in check while also trying to leverage the information in front of you not be affected by the case you just saw or something you just heard at a conference, or an error that you experienced years ago that's still impacting the way that you think about diagnoses.
And I think those biases have been well published and discussed at nauseum in healthcare, but we're kind of dealing with this new human plus AI dance.
That's fascinating. Yeah. I mean I actually slipped and fell down a few stairs at the weekend and bashed my head slightly on one of the stairs, and then didn't feel very well, and I was like, I wonder if I could be concussed. So I did a selfie and sent it to check GPT and it said my eyes look fine. So I actually, if I'd been more wired, I would have gone to the doctor. But there's a kind of a duck side to that as well.
Yeah, I mean I think it sounds like you did okay, But I would say that the old saying in healthcare during the particularly the rise of the Internet, right, which is kind of the other similar kind of technology logic
advancement that impacted healthcare. We used to say to our patients, you know, your Google search does not replace our medical degree, right, And that wasn't meant to be a condescending but it was just sort of like we had to sort of pull them back from the abyss of going down a rabbit hole and every ache and pain was immediately terminal cancer, right, that kind of But today it's different. It sort of
reference the experience you just mentioned that's happening everywhere. In fact, the recent open Ai launch of GPD five, they spent fifteen minutes talking with a patient who went through a very difficult battle with cancer and worked with the model herself and was able to have very complex medical jard and explain to her in plain English, was able to
help her with questions to ask the position. And as someone who still practices and sees patients today, I have to say my patients are better informed than maybe ever and it's kind of changing the bar with this classic information asymmetry problem where the patient has to kind of keep up up with the technical speak and all the information that we spend decades learning.
It feels like there's almost a better playing field.
So I can have this conversation with my patient almost at a peer level, is right, and then we can go through the care journey together. I'm extremely excited about that prospect.
Taking a couple of steps back, I mean, you mentioned you've been in and around this since twenty twelve, twenty thirteen. Why do people want to use AI medicine.
Well, it's an incredibly challenging discipline and it has only become more so maybe in the last ten or fifteen years. One of the things that is going on is that information is doubling roughly every ninety days medical information. That trend has been going on for a really long time.
And what does publication of papers, publication of papers, new therapies, new guidelines, all these things keep stacking up, right, And so just because you've been through medical school and training, right, we have lots of systems in place to help us continue our education. But really the reaction to that has
been to sub in some cases sub sub specialize. So to give you an example, I am a diagnostic radiologist, so that's the bigger specialty, and then I specialize in interventional radiology, which is an image guid to procedures basically, and then I am further specialized in pediatric version of that. So that's like a Russian nesting doll of specialties. And
you see that throughout healthcare. And that is partly due to the complexity of care that's required for some patients, but also it's due to the information tidle wave and being able to hold all that in a human mind right with all of our limitations, and so AI, I think at least the work that we've been doing here is starting to provide a counter narrative to needing to be sub subspecialized in order to be able to manage information and take really good care of your patients across
a wide variety of complex diagnoses. And I think that that's really where the excitement is. I think right now is can I use this system to augment my ability to care for PAYP.
And why isn't AI more ubiquitous in medicine? And what has been integration challenge up until now, Well.
There's a whole podcast just on that odds, I would say, but the short version is that we have been an incredibly skeptical discipline it's skeptical of new technology and at the same time extraordinarily risk averse for good reason, right, we require significant evidence, right to change the way we practice. We have you know, as you know, clinical trials take years and years, and some still fail, actually many fail, and we accept that as the system that keeps our
patients safe and keeps us on the cutting edge. I think in terms of just the technical mechanics of adoption, we have a very rigid system in the software two world that is changing. What's so again, what's so exciting about this is that again any physician can pull out their cell phone and interact with this cutting edge AI without needing to have you know, three four year long
cycles of integration with software. Right, and it's just the early days, but as of the trends that we're saying, just to.
Take a step back, I guess the classic model of measuring AI performance versus doctor performance was to present a hard problem or a hard diagnostic conundrum and ask for an answer and measure answer versus answer. How is that different to what you've done?
Yeah, well it's even less precise than that.
So that the way up until now, at least for large language models, when people talk about they have medical capabilities, they were actually using medical examination questions.
So there's a question stem and then there's a multiple choice answer.
That's not medicine, but it is how we you know, qualify our humans, right, human physicians to be granted a medical license, so that we think we kind of use that for a long time as a as a surrogate or a bell weather, But it wasn't.
Could it pause a test to be a doctor rather than could it actually be effective at acting as a doctor.
That's interesting, right, And we were able to show very early on with GPD four that these models outperform positions on these multiple choice tests. But there's all kinds of caveats there. Is that really medicine? Has it seen some of that data and it's training assuredly?
Yes? Right? And is that useful?
I think those questions came up now in practice, it's estimated that ten to twenty percent of AI interactions with these common chatbots like GPT are around a medical use case. So we know that there's someone is getting value out of that somewhere, right, and we see it with our own eyes. So how do we bridge the gap to something a slightly more realistic in terms of not giving you all the information up front, just like we would
in real healthcare. One of the principal thoughts around the study was is there a way to take advantage of the incredible capabilities that these models have in medical diagnosis.
And knowledge but also push it a bit further.
And not have it kind of just be a question answering machine. And so we thought, can we kind of have several versions of the model kind of act as different humans or this is that idea of an agent, and give them jobs. One job is to look at the economics of the tests that you're trying to order. One is to question your next decision point. So the information isn't just in and out with one model, it's
actually in and out through a system of models. And we showed that no matter what model you use, whether it's Google's model, whether it's open the Eyes model, whether it's an open source model, it improves that diagnostic capability on these extraordinarily challenging diagnostic tests.
So you had ten co authors on this study, and you know, as we talked about when it was released, took the world by storm, and so, I mean, how did you go about designing the study and what was the hypothesis and what have you found?
So this was a cross Microsoft collaboration, but harsh and Noori, who is the lead on this, really wanted to say, you know, we have a lot of evidence that these models perform well for these standardized tests, and then we see the real world situation where that's not how people present. They don't show up with hey, these are all my tests, these are all my problems, and these are the four
choices of what I may have right. And then taking what are essentially some of the most difficult questions out of New England Journal and structuring them in a way that requires a model to ask for more information or order tests, just.
Like a physician would.
The hypothesis was that that would be interesting and of itself, but then what if we also put humans through that same system. In other words, here's the first step headache, Okay, what do you do next?
Well?
Do I need to ask more questions? Do I need to order a test, et cetera, et cetera.
One of the really brilliant outcomes here was by having that system of agents as opposed to just the single model, allowed us to have a more realistic understanding of the capabilities. In other words, if I wanted to know the answer, and I'm a chatbot, my answer could be, let's order every single test that there is, and that would probably get you the right answer.
Is that feasible?
No?
Right?
Ye?
So forcing it to think about resources cost of the care actually found a very interesting what we would call the pride or frontier of capability underconstrained resources. So they were actually getting to an incredible diagnoses very very accurately, but also cost efficiently, and that was really one of the biggest takeaways from this work.
Can you just to make it more concrete for our listeners, can you kind of set up one of these cases as though an episode of House Dare I say, and then what the human doctors did and what the AI agents did, and then how you compare that performance.
Let's just say it was someone that had easy bleeding that unexpected. They were brushing their teeth and they started bleeding and it was kind of unusual, and they noticed that they were getting a lot of bruising, and there's just a certain battery of tests. I think that was pretty comparable on both sides in terms of what they ordered. But taking continued to.
Be what the AI ordered and what than human doctors.
Are human and AI pretty much right.
So the first few steps, I think a lot there was a lot of similarity, which is expected. Where we started to see early diversions was because of that agent setup. Humans did kind of jump to more advanced tests more quickly, more expensive tests, and that was interesting because the models were able to kind of get to the next step with a battery of less expensive tests. So we thought that was a kind of an interesting starting to see some divergence. And then, to be fair to the humans,
they're still kind of handcuffed. In other words, they're just getting text feedback as they're interacting with the system, whereas when I'm with a patient, I'm seeing them, I'm able to kind of take some cues, I'm able to examine them. So there was some limitations there, but then the less once it got to the stage where you had a differential diagnosis, so a list of likely things, more often than not, the model was ranking them in a much more data driven order that ultimately led to the correct
diagnosis much more quickly. Whereas you know, as us you would with humans, with these limitations, you're kind of going in some rabbit holes, you're maybe not ordering them in the best order, and so you're kind of going down other paths that end up increasing the time or expense or potentially leading to the rown diagnosis.
After the break, how the multi agent system the diagnostic orchestrator actually works stay with us.
I put the study through chet GPT describe the diagnostic orchestrator as like a virtual team of five doctors, each with a different role. One less possible illnesses, one chooses the best tests, one plays devil's advocate, one watches the budget, and one checks the quality of everything. The team talks it out step by set, but decides what to do next. Is that is that a fair summary? That's exactly right? And you can have infinite numbers of those agents.
I think these five were just kind of a scratching the surface of what's possible. I will say just quickly that I was incredibly happy to see that the curmudgeon agent we called it, or the Devil's advocate agent was helpful because you get into these group things situations, and it's kind of fun to watch a model argue with other models about some of the decisions being made in questioning the steps. So where the models fall short today
is outside of the text domain. And what I mean by that is models are incredibly good at understanding medical concepts as their communicated in text form, but when you get into the images and genomics and waveforms and all the other types of ways that we take care of
our patients, the models are vastly underperforming humans. And a good example of that is if I needed to look at a chest sexuray in one of these diagnostic steps and the model had to interpret the chess sector, it couldn't read the report actually had to look at the image, it would fall short and fail nine times out of ten.
So we know that that's a significant gap. But on the other hand, most healthcare right eighty percent of physician or patients interaction with their healthcare systems involve some kind of other information like a ECG or a biopsy path slide right or a MRI for example. So I'm hoping to see agents that have those competencies included into the mix, or we can start to really get to a place where the diagnostic environment meets how we're testing the systems.
There was a study last year which I was fascinated by. Wish is that AI diagnosis in this study was better than human plus AI. In other words, I was a study, and you would assume, or you would hope, that a doctor using AI would be better than just an AI diagnosis alone. But in fact, the human plus AI model was worse than the pure AI model. And one of the conclusions from this was maybe that the doctors what didn't want to listen to what AI was telling them.
But I mean, did you see that study and did it give you pause?
For more than a decade we've been kind of dealing with this unexpected result. This goes all again, all the way back to the earliest days of applying at least some of the powerful deep learning systems in healthcare, we have consistently seen that, in other words, in whatever set up the AI, if you just leave it alone, typically does better than the human plus THEI or.
The human alone.
Now is that a indictment on the human ability or is that more of a Have we set this up in a way that either doesn't favor the real world, or have we not figured out the ideal human computer interaction or how we should be What task should we be offloading to the system versus the task that we should be collaborating with the system on I think that's really where the exploration is that I'm interested in, because I still hold out hope and sort of some sense
of self preservation, but that there is a future where the two are better. Just how to offload what job and in what sort of system that ultimately becomes. Maybe it's five agents, maybe it's ten, maybe it's a thousand. You know, we don't know the answer yet. We're just barely scratching the surface. But in three years time, I expect this to be fairly common, that clinicians of all types will be working alongside and or even consulting with some of these systems for their care their patients.
And what is the adoption rate today? I mean, how far what would need to happen for this, you know paper that you've written in the system that you developed to be widely deployed in US or global healthcare.
In a very practical sense, there is a lot of regulation around this, and regulation requires very rigorous study and evidence and real world deployment, all the things that you would expect right if you're you know, care team is using some of these things to take to take care of you and your health problems generating that evidence, working with policy makers, trying to figure out exactly what evidence would get to the point where we can say definitively
this is at standard of care or beyond and it should be used and here's how you use it. Those are very mechanical, but they're very important. It may also require a change in how we approach the regulation of medical software because these kinds of systems are challenging our traditional software that we have used for decades in healthcare.
Right, they're very different.
They're non deterministic, they have moments of brilliance and moments of you know, stupidity. I should say, right, you've seen these kind of things, and so how do we actually design a system where it's safe, effective, and actually improving outcomes?
And that's ultimately the evidence we have to generate.
Yeah, I mean beyond stupid mistakes. How do you see the risks here? I mean we're seeing this research around you know the problems of cognitive offloading with AI, some suggestions that if you use AI too much you become dumba and deskill yourself. I mean, is there a risk of de skilling doctors? Like what are some of the maybe intangible but nonetheless medium time risks that we should be considering here.
What they refer to a skill atrophy as real, and we've seen this in various other disciplines too. I think it will also require a shift in how we think and perform our knowledge work jobs. And in one way this has been sort of looked at is via the idea of meta cognition. So rather than you having to be the central source of decision making, are there things that you can manage? So the imagine you managing a
team of a these agents. You have a goal, but you're offloading some of the cognitive tasks to those agents. Those are some of the early discussions around it. But I fundamentally believe that everyone that's in a knowledge work industry or role will have to rethink how that role
evolves in the future. And this is kind of that first step, at least for us in the healthcare space, which is that do you need to memorize all these facts or do you just need to be able to have the right judgment and know which where the models are good and not good, and be able to fill those gaps and manage it like you would a team, like any manager would know one oh one, this person's good at this, they have some weaknesses here, right, I'm going to sign this task to them versus this.
You know.
That's that metacognition world that where I think we're rapidly heading towards, and healthcare is going to have to figure out a way to do that as well.
Yeah, I mean the other question around deployment is conflict of interest. Right, So the previous research I've seen is all around AI versus human doctors. But this element you've added is to cost as well. It's not just outperforming, but it's outperforming at less costs in terms of tests. Is a really interesting element, but it adds the potential for major conflict of interests on both sides.
Right.
So for example, I'm British, grew up with the NH and one of the consistent themes of the NHS was death panels. Are there bureaucrats deciding when people should die? What is the appropriate level of care to give to people to prevent them from dying, given that it's a
drain on a public budget. That's in the UK. Here in the US we have these for profit healthcare model where there is an incentive which if you're insured do you sometimes worry about that your position or healthcare system is pushing you through numerous medical procedures because ultimately it's a profit center and you may not actually need them. So how do you begin to grapple with those problems when you think about a system like this.
These are problems that have existed even as you know before before AI, and I think that the responsibility for those of us kind of generating the evidence and the capabilities, and kind of displaying the rationale behind how these things work or don't work and where they work is a conversation entirely from both the economics and the cultural societal aspects of just how we deliber care. I think it wouldn't be a controversial statement to say that, at least
from the US perspective, that our healthcare system is not ideal. Right, And that's true whether you're in a capitated system, a fee for service system, or a government based system like the VA. I have to hope that the better angels prevail here. But I agree with you and share your concerns that in the wrong hands, or with the various misalignments that happen in these systems at all different levels, we could end up causing some disruption in a way that we aren't hoping to see.
In the end, it was an interesting blog post that you wrote around cancer care, and it al really struck me because I mean, I don't want to put the words into your mouth, but as I read it, if you are one of the lucky few who gets to go to one of the great cancer centers like M d Anderson when you're sick with cancer and have access to these cross field panel of experts who, as you mentioned earlier, sub or sub sub or even subsub sub specialized,
you have a measurably better outcome, whereas in fact, most people in the US and certainly almost all people practically speaking globally don't have access to these cancer centers. Talk about that and about this idea of the multi agentic AI and how it sort of reflects or refracts what we've been talking about with the diagnosis piece.
Yeah, this was a very important So, yeah, thanks for
bringing this up. I think a lot of people don't know some of the inner workings of healthcare where some of the really big bottlenecks are in terms of getting the best possible outcome, and one of them is in cancer care, as you're pointing out where some of the leading centers, and in particularly larger cities, they have the ability to bring specialists from all different disciplines together to discuss the patient's care, and that's called the tumor boarder
multidimary tumor board. The reason that not everyone can do that is not just because they don't maybe have that specialist in house, but also because of the massive amount of prep time it takes to gather all the information. It's not just the patient's data that you have to gather. You have to gather what clinical trials are new and availed in this patient eligible for, what does the latest literature say, And someone has to go through all that information,
prepare that and then present it to a group. And what we found from the ASCO, which is the large society in cancer care in the US, was that it takes between two and a half and three and a half hours of preparation time per patient, and some centers run thousands of these tumor boards a year, and those are the ones that have the most resources and certainly
the most access. The idea of AI, fundamentally for me, and the reason I'm in this field is that I want to democratize that experience for everyone, increase the access. So no matter where you live or mate matter what you do for a living, should that same level of precision when it comes to your healthcare. And so this
was that first step to that. I do think this is going to continue to evolve back to our conversation around managing a team of experts as your primary physician, could I call on a team of expert agents to help walk through some of the things that we might not be considering in our fifteen minutes we have together
once every six months or whatever that looks like. I'm very hopeful that given the right circumstances in the way the technology is progressing, we're going to get to a place, I think, in a perfect world at least where the access for every patient is equivalent to those who may have access to the best resources.
I mean, you mentioned that twenty percent of all AI search is or op to twenty percent of a AI searches are around medicine, which is fascinating. I didn't know that. But there are of course other people who don't want AI in the healthcare settings, or who worried about their human docture or their primary care physician being replaced by an unfeeling machine. What do you say to them?
It's interesting, I think going all the way back to the earliest days of search, that stat was still about the same. Up to twenty percent of Internet searches were healthcare related. And we're seeing two interesting trends. One from the economists that showed that these searches that are going on today in the typical search engines, the one that's going down the fastest is healthcare. Isn't that interesting because where are people going?
Then? Well, they're probably going to the models. So I actually push back on that.
I think that most people want to be educated about their medical condition, and they want to be they want to feel safe and free to ask questions about their own healthcare and essentially infinitely patient knowledgeable sort of oracle environment. And again we're not there yet, so I don't want to make that claim. But I even me, I put my data into these models and ask questions about it, and I walk away sometimes learning something or at least
what I should be asking my physicians. So again, would I rather do that than any healthcare Not me personally, I do want to have that relationship with my physicians, but I also want to walk in much more knowledgeable, so I feel like we're on a pure level when we're speaking about my care decisions.
Matt, thank you, Thank you so much.
As that's it for this week for tech Stuff. I'm Cara Price and.
I'm as Valosha And this episode was produced by Eliza Dennis, Tyler Hill and Melissa Slaughter. It was executive produced by me Cara Price and Kate Osborne for Kaleidoscope and Katria Novelle for iHeart Podcast. The Engineer is Behit Fraser and Jack Insley mixed this episode. Kyle Murdoch wrote our theme song. Please do rate, review and reach out to us at tech Stuff Podcast at gmail dot com. We love hearing from you.
