What Values are in AI? A Conversation with Dr. Zak Kohane - podcast episode cover

What Values are in AI? A Conversation with Dr. Zak Kohane

Dec 17, 20251 hr 18 minEp. 37
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

For Dr. Zak Kohane, this year’s advances in AI weren’t abstract. They were personal, practical, and deeply tied to care. After decades studying clinical data and diagnostic uncertainty, he finds himself building his own EHR, reviewing his child’s imaging with AI, and re-thinking the balance between incidental and missed findings. Across each story is the same insight: clinicians and machines make mistakes for different reasons — and understanding those differences is essential for safe deployment.

In this episode, Zak also highlights where AI is spreading fastest, and why: reimbursement. While dermatology and radiology aren’t broadly using AI for interpretation, revenue-cycle optimization is advancing rapidly. Meanwhile, ambient documentation has exploded — not because it increases accuracy or throughput, but because it improves clinician satisfaction in strained systems.

Yet the most profound theme, he argues, is values. Models already show implicit preferences: some conservative, some aggressive. And unlike human clinicians, no regulatory framework examines how those preferences form. Zak calls for a new form of oversight that centers patients, recognizes bias, and bridges clinical expertise with technical transparency.

Transcript.

Transcript

And so there's a mission here that the Human Values Project has: A, is to document the, you know, both the human and the AI side of it and the differences. What are the things that inform, and what can we do actually to influence or align these models to be more like what we want 'em to be in the right situation.

What was also been interesting to me is that there is no authority we can appeal to in some sense it calls for a super, super powered, uh, consumer reports for health, but that of course does not exist. So, it's an open question with a Human Values Project. Once we do the work, who's going to actually be the advocate for good behavior, for decision making that as patients and doctors and as a society we would, could stand behind? Hi, and welcome to another episode of NEJM AI Grand Rounds.

We are delighted to bring you our episode today with Zak Kohane. This is our annual installment of our year in review, as many of our listeners will know, Zak is our editor-in-chief at NEJM AI. He's also the Department Chair of the Department of Biomedical Informatics at Harvard Medical School. And as with all of our conversations with Zak, if you know him, you're guaranteed to have sort of a fun and funny interaction with him. He was funny and as always, he had some curve balls for me and Andy.

He sort of threw some questions at us as opposed to letting us just ask him questions on the interview. And we discussed everything from AI for peer review to, uh, some of the big papers that we published and that we saw published in other journals over the last year. So, Andy, I thought this was a lot of fun, as it always is with Zak. I mean, whenever this lands on my calendar every year, I look forward to it so much. It's always a highlight. True to form.

Even though Zak was on West Coast time and we recorded this at noon, he poured himself a glass of scotch to make sure that the, to make sure the ambiance, as Zak would say in his French Swiss accents, uh, was set correctly. Um, again, always a lot of fun. I think these tend to be some of our more popular episodes. What I really liked from this episode was the preview of an experiment that you guys had done at NEJM AI of getting AI to review papers. And those reviews have since come out.

I think they've been favorably received. And I, again, I think kudos to you and Zak for pushing the envelope on how to update what I think was a process that was started in the Middle Ages for peer review, or 1600s, or long, long ago. I think you guys have done an admiral job at updating that. And the discussion around that was really, really fun for me. The NEJM AI Grand Rounds podcast is brought to you by Microsoft, Viz.ai, Lyric, and Elevance Health. We thank them for their support.

And with that, we bring our conversation with Zak Kohane on NEJM AI Grand Rounds. Alright Zak, so I think we've asked you the question that we ask everyone when we get started. But maybe we can focus on 2025 and we're gonna dig into the papers that we published that we liked, that were some of the themes that we saw in both in NEJM AI and in the general literature. But maybe before that, uh, could you tell us about the updates to the training procedure for Zak Kohane's neural network in 2025?

What data and experiences related to AI have most informed the updates to your outlook over the past year? Well, a surprisingly large amount. I would've guessed not that much, but in fact a lot.

So, for one thing, is although I've been playing with AI augmented programming for a couple years now, this year it really came into its own and it's become so good using, uh, Codex or Cursor that when I want something to get done, it's actually pretty close to easier for me to actually code it than to find an app that I can coerce into working my way. So, for example, I built a model trained on my Gmail. To answer my emails. I don't use it to answer my emails.

I use it to suggest how I should answer my emails. So, I have a few more things I can think of while it's thinking. I have a to-do list and that I host on a private website. And it's the to-do list I always wanted and nothing worked quite that way.

And I wrote a personal electronic health record that takes all the scans of all my medical information that does not make it into an electronic health record, and took all the smart and fire extract through Apple Health, and created an integrated electronic health record that has labs, procedures, visits, and now is fully available for critique and uh, review by an AI. These are things that would've taken a year perhaps for a team. It took me the longest one was the EHR.

It was maybe six hours of work and that's telling me something about the future of interpretation of health care data and how fast it is possible to move in the space. So that's one bit of update. Can I ask you a question about that, Zak? So. Yeah. Vibe coding has been the term of the year and these startups that enable vibe coding like Cursor is now raising at like a $30 billion valuation on $1 billion annualized revenue. Do you see that impacting medicine?

I think you have always been the exception when it comes to, uh, like hacker ethos in medicine. Do you see a potential for that to change though now that you can essentially no-code your way to something sophisticated like you just described? Well, it's not quite no-code. I was thinking about, you know, vibe coding. So, here's the things that made that EHR work. I knew about launch codes. I knew that it had to bring together similar lab types.

I knew that it was confusing procedures with imaging studies. So, there's a bunch of domain knowledge that I had which made the vibe coding work. It's not that much knowledge, but I think there is a bar to entry that you actually have to understand how things are constructed and how they work in order to be able to effectively vibe code. So, I don't think it changes things revolutionarily on the perhaps patient side, but I do think it creates a lower barrier of entry of new players.

But I'll counter that with the other part of my, uh, learnings of my neural network, which is here we are in 2025. That would be seven years, Andy, after you and I wrote that editorial about the retinal imaging study. And still, by and large, dermatology, radiology, and ophthalmology are not using these models at scale to interpret medical data. And what does that mean? We could go long and I'm not going to go long, but I'll contrast that with what was

predictable. And I had predicted, among others, which is the revenue cycle part of it has really expanded and the use of, in plain English, AI to upcode maximally code clinical encounters and to then adjudicate reimbursement has definitely increased, has been successful from the perspective of purchasers astonishingly in ways that I would love for you and Raj to speculate. The providers are actually outpacing the payers in successful use of AI to change their side of the reimbursement.

Quick guesses why that might be? Raj, do you have any guesses? My immediate response is that if you're asking specifically about the sort of reimbursement side, I think that's kind of a separate issue. But let's say just more generally first use of AI, right? Use of AI in clinical encounters by providers versus use of AI tools sanctioned by institutions or sanctioned by insurers or payers. Right? And I think there, I absolutely agree.

I think if you go to the Harvard hospitals, right, and you walk around the floors. And you go to the sort of resident's room where they're all working and typing notes and talking to each other. There's a good chance you'll see ChatGPT open, or one of the other AI models, or OpenEvidence, or one of the sort of cousins of these either general large language models or medical specific models. I think the reason is they're, they're convenient and they're adding value, right?

They are convenient, they're easy to use, they're free, they're seen by the providers as helping them find the information that they need to find more painlessly than their existing workflows to help their patients. And so I think the tools exist. I think they're being used at sort of massive scale.

And what's fascinating to me, we've talked about this a lot, Zak, is that most of the use of AI that's actually changing care is not through the sort of top down, sanctioned by the institution, even vetted by the institution. It's happening on their iPhones, their laptops, outside of the sort of official blessing of the health care institution itself. That's my first reaction. The tools exist. I think doctors want to use them. Patients want to use them.

They don't have enough time with their physicians. All those reasons create sort of massive demand and appetite for extra expertise that is made very conveniently available through these models. I think the reimbursement side of it is, like, sort of entirely separate optimization function and set of considerations. I'm gonna defer that one to Andy. Uh, if, uh, Andy wants a volunteer to answer there. Oh, I was, I was just gonna say I agree with that.

There's always been this impedance mismatch in health care about who buys the software and who uses it. And normally at the organization level, the CTO or CMIO or whoever is, is buying Epic or buying enterprise scale software, but they never end up using it. And so AI is now like enabling direct to customer B2C versus B2B to providers.

And clearly the providers are getting value in a way that might not show up in a reimbursement bottom line, and they're willing to pay the 20 bucks a month or something to get that value in their practice. That's right. And, but what's really quite remarkable is in our training as doctors, as members of the medical community, we have to take all this training about ethics, about conflict of interest, about safety.

And then, when it becomes convenient as it is to use these tools, almost a necessity, then all these tools that, that Raj mentioned, ChatGPT, OpenEvidence, all these things are being used. Let's be clear, without direct approval of any CIO at the hospitals, maybe they've agreed to let open the ports, but that's about it there. And there's no contractual relationships. And what's the business model here in addition to the 20 bucks being paid a month? It's what?

It's a really amazing thing that you really have to be deep thinker to think of, uh, what's it called? Advertisement. And advertisement is driving a company called OpenEvidence, $6 billion valuation. Definitely providing a value to providers. There's no contracts with the hospitals about it. And yet, for the stuff that hospitals might feel responsible, could conceivably think of themselves responsible for like the quality of radiology, the quality of the dermatology reads.

Are there any contracts happening there? No. And where are they investing? Actually, in addition to the revenue cycle part, the big surprise for me has been how the hospitals have been willing to pay very large amounts of money for ambient dictation. And this is something that they don't get reimbursed for by the insurance companies. So, it's really out of the pocket of the hospitals that they've actually decided there. And I'm not exactly sure what that means,

'cause on the face of it—. Isn't, isn't that just upcoding by a different name though? Doesn't that just let them capture and transmute an interaction into more billable artifacts? Is that not the straightforward first order reading? Could be that. That's actually, uh, I hadn't thought of that.

My first order was, for me, surprisingly naive, maybe they actually care about the wellbeing of their doctors and the, because it's not clear from any of the studies that, including the ones that we've published, that it increases productivity, um, number of patients you can see, or accuracy. But it does increase satisfaction. People feel good about it. But you know what? You're probably right.

So, there's one more thing that, um, has influenced my neural network and I immediately copped to recency bias, and it just happened last week. One of my children is a college kid and he was complaining of back pain. And it was pretty bad. And he's usually not a whiner. And so, I was trying to give him some help. And then he described it as being sort of in a point in his back and I said, well, maybe it's a kidney stone. I mean, I've, I've had a kidney stone.

But he said it goes better when he laid down, which is what you get for back pain. And I asked ChatGPT, do kidney stones get better when you lie on your back? I didn't think so, but, and then it said no. Um, 30 minutes after I told him that he started peeing blood. And so he, I sent him to the emergency room. He got a CT scan and they found a kidney stone. Great. I really, uh, applaud, uh, the doctors there. And then the doctor said something which I also applaud.

They looked back at a CT he had had for another reason, the pre, the prior year, and the doctor saw on that prior CT of the abdomen for another reason. The kidney stone was there a year ago. And so on the one hand, I really applaud doctors who look back at previous imaging. Not all doctors do that. That's already, I think one sigma, uh, better doctor. But it made me think, as you know, I've been very concerned about incidental findings.

On the other hand, maybe I should be more, um, concerned about missed findings. Mm-hmm. And I was talking to a friend of mine who's the head of research at a large HMO here in the Bay Area. And she was telling me, based on this history, from now on, whenever she gets an imaging study, she's gonna ask, actually ask for it as an image and run it through a large language model. Say what things do you notice?

So. Zak, so it's, first of all, I'm glad your, your son's okay and is getting, you got that attended to. I have to ask about this because you've been so vocal. Not only appropriately, but it's been a necessity to have your voice in the field around the risk of false positives with incidental findings. And I think you first made this argument, although it applies everywhere within medicine, you first made this argument very cogently and loudly in the context of genomics, right?

And so, I think this is maybe either 2006 or 2007, you published this piece, which you, I think, are alluding to here. Most of our, many of our listeners will know about it called "The incidentalome," where you pointed out that even if genomic tests have nominally good sensitivity and specificity, if you run a lot of them, you're gonna get a lot of false positives. And we have a genome.

And so, there's a lot of measurements and you can get a lot of false positives even with nominally good performance characteristics. And so, AI models, right? Scans in patients' hands. So, many things in my head are like, oh, this is great. We want patient empowerment. We want patient control data activated patients.

I'm thinking about all these stories that we featured at our conference every year, and that, these sort of activated mothers of kids who they've really changed the trajectory of their family's lives and their children's lives by just being so aware and so, such a force. And then having data, right? Not just being sort of on the receiving end of advice from the medical system, but themselves controlling the data. But then the other part of my brain is like this sort of cognitive dissonance.

It's like, are these not wonderful tools for just incidentalome on steroids where you're gonna find lots of worrisome things. Patient control models, they run it every day. They run it five different ways. They run 12 different prompts. And you're in false positive theater for these, these same scans. And to say, even with CTs for example, we know that the models are, like, very adept and fluent with text, but we know they tend to be a little bit worse.

And I don't know if this gonna be true in a year or two years from now, but currently they tend to be a little bit worse at image interpretation. So, how do you reconcile these sort of two, these two competing forces? Well, first of all, I think it's important to be humble always where possible. And it just shows you how when bad events hit, you start looking at the other side of the coin. Right. Right. The false negative part of it.

And I'm gonna about to argue why this is not the case. But I will immediately wanna recognize, and that's why, for example, you, it's very persuasive when someone talks about, for example, oh, I did this genetic test, and it found out something that I wouldn't have found out otherwise. But you know, you and every time you do that, you forget about all the times where you found something that was not true, and all sorts of potentially very dangerous things were done to that patient. Right.

And you're absolutely right, with AI, there are so many ways to slice and dice those data to create commentarial number of false positives that, uh, we should be worried. In this instance, so, first of all, let me cop to that, but second of all, I just want to point out that in this case, the prior is actually higher. Yeah. What do I mean by that? When you have a healthy person, you're doing a whole genome scan of them.

It's not that there's no complaint and therefore a lot of rare things are by definition rare. But if you actually have a complaint that could be causally related, plausible. And that led to a CT scan being performed, right? That's right. It's sort of—. Then you're in different set of priors. Right. And in this case, so again, retrospective scope is one of the most effective tools in medicine, but it was an abdominal complaint.

And it may have been that his abdominal complaint a year ago was related to the kidney stone. Yeah. But regardless, the fact that there was something in that region that caused a CT scan, I think raises the priors to the point where a prior aware AI could reasonably say, hey, have you thought about this finding. Yeah, no, that, that makes a ton of sense.

So, I think just at risk of sort of overdoing the analogy, the reason that this is not the sort of genomic incidentalome to you is that this is not sort of imaging being performed on everyone for just, you know, to inform you about your health generally. At the age of 20 or 25, this is a — Exactly. — sort of targeted, enriched case where the prior probability of something going on is much the way in which I think genomics has often been positioned for clinical benefit or pub like public

health benefit. That makes total sense. It also occurs to me there's another sort of cognitive dissonance. We talked about it last year, so we don't have to, I think, redo this whole conversation, but I'm going to just maybe draw a comparison for you to either reject or support, you wrote this piece a few years ago, on how you took care of your mom remotely by using these sort of wearable measurements, so, or digital scales, right?

So, digital health biomarkers that you were able to monitor remotely and then intervene when there was a sort of aberrant trend for her weight that led you to help titrate her dose of, I think, Lasix, right, for managing heart failure. And it's so clear that that makes sense. It's also an enriched population. It's also sort of a very specific use case, and that's at contrast you imagine even piping that into an AI tool, it probably would

be pretty helpful. That's even, you know, the sort of big contrast with the general use of wearables in sort of young, healthy populations, which are measuring themselves and making sort of a wellness argument that they're gonna prevent disease by tracking. And we have limited RCT data at best that this does anything, right. For sustained health improvement. So. That's a great, it's a great business model though. It's a great business model.

But do, do you think it's sort of, there's a similar sort of analogy here, which is the use of AI, it's very focused, pointed at the specific clinical scenario to avoid the bad things around both false positives and false negatives. And in fact, it's the central challenge of pediatrics. 'cause in pediatrics, most of our patients are healthy. Right. And so, your main job as a pediatrician is to reassure parents. Yeah.

This thing that they're seeing is actually not death defying. Unfortunately, at a low level, there are these death defying things that are happening. And so, I think it's the hardest being, I never was a general pediatrician, except briefly as a resident, it's the hardest job. 'Cause you have to be on the one hand reassuring, cutting out the false positives and still be on the lookout for these crazy, horrible, rare events.

Yeah. And it's, it's extra hard 'cause you're talking to their loved ones most of the time. Right? You're talking to the parents who, they're bringing their own sort of set of goals and worries and concerns and whatnot. I remember, it's just taking me back to my first kid and worrying about every abnormal ounce of going up or down or, anyway, it's then you get better over time. Okay. Andy, is it okay time to sort of move into the, to the main work here?

So, we've done this for the past couple years, Zak, and we want to, we've really enjoyed sort of reminiscing, reflecting on the past year of work that we have been proud to publish at NEJM AI. And then also thinking about papers that sort of, we wish we published at NEJM AI that we wish were submitted to us and that we would've really enjoyed seeing.

And then in the same sort of vein, thinking about the trends that we've spotted over the past year and using that as a prediction for what we might anticipate is gonna be a big theme for work in 2026. So, maybe we could just start with NEJM AI. And so, you are our fearless leader, our editor-in-chief.

I think most folks know this, but it's worth saying that as editors like me and Andy, we manage the papers, we handle them through review, but you make the ultimate decision about whether a paper is ultimately accepted in and published in NEJM AI. And you see everything right on the way in, and I think also before it ultimately gets accepted.

And so, maybe you could start with just maybe a few examples of papers that we've published or themes in some of the work that we've published in 2025 that's really captured your interest or, or your attention over the past year. Well, Raj, um, as you know, I like to be listening to you and to Andy more than I'd like to listen to myself, despite scarless comments. Well, we, we also like to, we also like to listen to you more than we like to listen to ourselves.

So. You, Zak, you have a voice and a face made for radio. We like to hear you. Is it? Here we go. Oh. Just to be typical, as they say in Dungeon and Dragons, chaotic good. Raj, I'm gonna talk about two papers that will be published in 2025. There's the same one that would've published. Yeah, go ahead. I love it. And, but we haven't quite published yet.

So, I was at a conference about half a year ago where there were editors-in-chief of, let's see, Lancet Digital Health, Nature Medicine, JAMA AI, AI. And we were talking about out, when is there, when are there gonna be, someone asked, when are there gonna be AI reviews? And just 'cause I don't like being pious, I said, within the next year, and or within the next two years, I think I said, so Raj, can you tell us what we did about this and what happened? Uh, absolutely. Yeah, absolutely.

So, the first thing we did is we had a conversation that was really, I think, spirited and wonderful amongst our entire editorial board. And so, we had an editorial, like the larger editorial board meeting in the summer, shortly after that conference. And Zak, you presented a set of different approaches to carefully but near-term implementing AI in our peer review processes or in our other workflows.

And I think it surprised us both that the option that drew consensus from the editorial board was to use AI to fast track or accelerate the decision to publish for manuscripts directly in the peer review process. What the feedback we also got was that it's very critical to have extensive human involvement.

And we wanted to really be, I think, thoughtful and cautious about the first few papers that we did this with and how we developed our process so that we really were publishing things that were very high quality, that were up to our standards and that we would be proud of while also using AI to speed up peer review. And so, I think we're proud to say, and you know, it's, they haven't been published right now at this very moment while we're recording the conversation.

But I think when this comes out in a few weeks both papers will have been published. We had two randomized control trials submitted to us that were the first, and I think our very substantial, well conducted RCTs to evaluate Ambient.ai dictation software in real clinical wofrkflows.

And so, they evaluate three different commercially available products that are being widely used in the United States from different manufacturers and studied professional fulfillment, documentation time, a work outside of work burnout, and a few other metrics through this mix of both primary outcomes and secondary measures as well. We're gonna publish both of these RCTs very soon.

We're also publishing an editorial that lays out both our rationale and our process for running these fast track AI integrated peer review processes. And just to spell out and sort of short form what we do, we first have one of our deputy editors or associate editors write a review independent of AI output, uh, full complete review of the original article, and then the same editor feeds the paper into, with the author's permission.

So, first of all, we secure the author's permission to have AI participate in the peer review process and with the author's permission, we then use two leading large language models to generate reviews that are then used alongside the editor written review. They're presented at our weekly editorial meetings, debated vigorously amongst the editors. And then we promised the authors a decision within seven days from submission. And so, both RCTs in Ambient.ai went through that process.

Both were accepted conditional on meritorious revisions, and both teams of authors responded to both the human and the AI reviews point by point in the supplement of our editorial. We include the responses to the human and AI reviews, both the human and AI reviews, as well. We really encourage listeners and readers of the articles to judge for themselves. So, the quality is of the AI and the human reviews. You have sort of a direct contrast there.

And then our statistical editor also applied his own process, which was much more iterative, where he engaged back and forth, had a whole conversation with GPT-5 to generate a statistical review. And we included the full transcript of that conversation, which was honestly, I think, very eye-opening just to see a window into his thought process and this kind of iterative model as opposed to sort of one and done of putting something in and getting a one output from one of these AI models.

And then there's also a really nice reflection from Karandeep Singh as a content editorial about these two papers. So, that's what we did. These are the first two RCTs. Oh, remarkable. And it was seven days. From seven days submission, seven days from submission to decision. And I recommend everybody to look at it and to compare the humble Manrai with thinking compared to GPT with thinking. Yep. So, that's that. Yes. So, I think that's important and it, the implications of this, I think will

also be beyond that. Now, because we were just informed by right before you got on Raj, that, uh, Andy had been out, out for his company, 10 days on the road. You know, I haven't seen him for a while, so I, we better make sure he, what? Ask him a question. Which paper did you like? Yeah, so I've been living in a different world for the last year. I'll give you a couple things that I saw in NEJM AI that I thought were indicative of larger, interesting macro trends.

And then I also wanted to mention some stuff that happened in the larger AI world that has influenced medicine through Raj's work and through other people's work. Lemme give you my quick snapshot. So, former guest of the podcast, Anil Palepu, uh, one of my former PhD students, I think published a case report with you guys, uh, at NEJM AI on using large language models for breast cancer care. And again, the team does great work. It was another, like, great example of— It's AMIE, right?

—high quality AMIE. AMIE, Google. Exactly, yeah, yeah. Of the, of the AMIE system, the dialogue system. And then there was an accompanying editorial that was published alongside that called "Humanities next medical exam: preparing to evaluate superhuman systems."

And I think this ties into some of the larger macro trends and I wanted to discuss, like, yeah, Zak, I was reflecting on what I showed up to do with you as a postdoc 10 years ago, and it was essentially like, let's get an AI model to pass step one. That was like the North Star. People told us we were crazy. People told us that a machine could never do that and now it's like been trivialized to a degree that would be hard to anticipate 10 years ago.

And I think that, so "Humanity's next medical exam" is pointing to this other AI benchmark called "Humanity's last exam," which is a collection of very, very difficult questions that AI still can't answer. I think they get like 50% of these questions right. But what does the next phase of medical benchmarking look like? I think what NEJM AI does better than else is benchmarking through clinical trials.

So, using the traditional evaluation levers that have been with medicine for a long time, but what outside of those are big, expensive things to do. Raj's work has shown that on hard pathology cases they're already superhuman on the CPCs from NEJM and Dr. CaBot and all that. But I'm kind of wondering what medicine is going to do to come up with sort of the next hill to climb? What is the next thing that AI has to do in medicine that would convince either one of you — Okay.

— that progress is not slowing down. By the way before we go further, I agreed with it's remarkable to me that, wow, after all these years, I'm still aligned with Andy. Um, the AMIE the oncology decision making was also on the top of my list, and the, but the reason I put it there was because it was a thoughtful, negative result.

And that, you know, the AMIE team has done so many great things, but in this case, they, there were parts of the performance that they were not satisfied with, and they published it. And I was very, you know, pleased with that because that's just a good reflection of the truth. At the same time you mentioned Dr. CaBot, so I just have to mention another paper that I liked is good because it's basically blaming Raj. The article called "Raging against the machine: the human side of the story."

And in this, this guy says, my wife will tell you that I fall asleep with enviable ease. The night before Harvard Medical Grand Rounds in September of 2023, I was a restless mess. What if I got the answer wrong? What if GPT beats me into the diagnosis? Why would I agree to this in such a public forum? The thoughts ran amok through my head. And this colleague of Raj and his other friends have been multiply mauled by Raj and his AI CPC bots. I don't know what that tells us about the future.

'Cause I have my own issues about the generalizability of these cases, but it's certainly getting a lot of attention. Yeah. So, I agree with all of that, Zak. And I think Raj's paper on Dr. CaBot has, has been, I think, provocative, if I'm correct. You were at a party with Jared Leto and Halle Berry as a result of writing this paper. That's what I heard. Do you, uh, can you give us a little...? You know, the trash talk? Something something. The trash talk. Something like that?

Yeah. Yeah. The trash talk about Jared was just really unseemly. Oh, come on, come on. It was nothing but positive. Nothing but positive. Yeah, there, there's been some interesting, interesting things that have happened after, after that came out. And I, I have to say also I think it was, there's a physician who wrote the article in the New Yorker that described the demo that, that we're talking about with Dr. CaBot.

And that physician I think did a wonderful, Dhruv Khullar, did a wonderful job of, I think getting into both simultaneously the doctors and the patient's mind around the implications of these tools. What I have to say also is what I've learned, one of the things I've learned from this is that the medium really matters.

And showing a doctor an AI system that is not listing a diagnosis but is talking through a diagnosis and exposing its reasoning over five to 10 minutes is, like, categorically different for most physicians in the way they perceive it than even a full written version, which is already categorically different from just a list of final diagnoses.

And so, I think we are moving much more to that, sort of, sticking your neck out there for the AI systems, not getting the right multiple choice, but really letting yourself get embarrassed if you're the creator of one of these systems. And we have been embarrassed already. I should say that, too. We've been embarrassed by CaBot messing up, but we've learned a lot. Every time we're embarrassed, we learn a lot, and then we learn about

the sort of guardrails that we need to put on it. Where the edge cases are, how people will use it in ways that we don't anticipate at all. And I think there's a little bit of that build in public, that energy that I think we should be absorbing a little bit more in medicine that we're usually reluctant. You know, we think we should never do that. Like Silicon Valley's completely different. You know, they can break things fast.

But I think a little bit of the build in public energy or demoing and showing things off really lets you understand where the warts are, where the problems are, and really improve. So through, even in the article, identified problems with CaBot that have already led to sort of, of my grad student Thomas working on the next generation of it.

So, in the spirit of other articles that we liked, may not have been the best article, but because of the reception article, it's surprised me so much and then I learned other things in that regard. I think it's a bellwether. This is an article that we published about a randomized trial of generative AI chatbot for mental health. Because of the history of mental health and privacy and how in medical traumatics we've been so concerned about this.

Um, and because of a history which I don't wanna get back to of the ELIZA chatbot, I was sure that this would get a really critical flaming response and instead it was a, my response and a lot of positive response about use of mental health. And then people, friends of mine, have come up to me and said, you know, I've used, I don't wanna be given advertisement, this large language model, and they did better job than any of my, uh, therapists.

And so, this is not what I expected, but it tells me that there is gonna be a strong thread that will just get stronger in mental health management. That is so poor throughout the world, frankly, just because we don't have enough skilled therapists. And that the relationship, whether we like it or not, for a growing fraction of a population, not just the young ones, but certainly more the younger ones than the older ones, who are very comfortable using a chatbot as a personal therapist.

And I think, go ahead Andy. I was gonna say, like, to your point, like, what was super surprising to me was when GPT-5 came out and they sunsetted GPT-4.0, many people reacted as if they had lost a loved one. That they had become developed such a relationship with GPT-4.0, it was like a death in the family when that, and to the point where they brought it back by popular demand. And so, like, if that doesn't tell you where this is going, then I'm not sure what does.

People are gonna be using these for therapists for intimate partner reasons, for a whole category of relationships that we would have assumed were deeply human, um, will be subsumed by, uh, the models going forward. Yeah. And in the spirit of, compared with what I think it tells you just how inaccessible competent health care is and how this may be categorically different in a way that a lot of people find comfortable. Andy did, did you want to highlight anything else?

Well, I was just gonna do one out of distribution paper. Yes, that's right. Thank you. I think, yes. So, uh, when we were doing this last year, um, a model called o1 had just been released by OpenAI. So, this was one of the first reasoning models that had ever been released. It's hard to believe that we're only a year into that.

And then only five or six weeks later, maybe it was actually a couple months, a paper from a Chinese group called DeepSeek, released the DeepSeek-R1 paper, which essentially fully replicated all of those capabilities in open source. And so, there were lots of things that that paper catalyzed, reasoning models becoming the focus of AI for the next several years. And that's fed into your and Raj's work about how those models work in medicine but also catalyze the open-source model community.

And now almost always there is an open-weight, open-source model that's as good as the Frontier Labs. And that's despite the fact that these, uh, researchers from China were operating under a trade embargo, uh, GPU deficit, that they were able to do this kind of outmanned and outgunned. And so, I think that really updated my understanding of how deep the gravity well is for some of the Frontier Labs.

I think it's actually shallower than I would have appreciated before, that there are still almost black hole-like in there, how, how concentrated their mass is. But actually it made me think that they might be slightly, um, less sticky than I would have thought before that.

Raj, one of the biggest objections in the past about these large language models for use, frankly in academia, I've heard this from deans or in health care, is because you don't know where these data are ending up in your query stream. Let's ignore the fact that same question can be asked of search engines and, therefore, because of that concern, there was a slowdown, uh, resistance to adoption of these models.

Do you think in three years that consideration will meaningfully influence the adoption of these open-weight, open-source models or not? I think the Overton window and the sort of societal attitude and institutional attitude towards that risk is gonna be, I already feel that it's much more muted this year than it was last year. And in three years I think it's gonna be much, much more muted.

And I think the reason is it's gonna sort of follow the same trajectory as the cloud stuff in my mind, which is, uh oh, we need to have our own servers. We don't know where data's gonna go to Amazon or to Google. And then eventually, I think mix of both sort of enterprise guarantees and, also, a little bit of self-reflection about some of the challenges of running those open weight models. Because if you have Meadow's Model running locally, but your hospital's firewall is not secure.

I mean, exactly. The data's already out there. There's massive hacks already. Like, I think there's that reappraisal, it just takes a little longer. And I think, uh, some institutions, and those tend to be academic and health care related, but I, my sense is that in three years, it's not gonna be something we talk nearly as much about as we do today.

Yeah. My, my sense is the, uh, C-suite, particularly the CIO Suite, is gonna say, uh, you're offering me contractual guarantees and I don't have to worry about it, then you're gonna run it for me. And continuous uptime. And I don't need to have like a, you know, 40 staff like doing distributed inference for some llama model locally and GPUs and all that is gonna gonna feed into it.

That being said, I do think, uh, there's an educational role and obligation and mission that we all have through NEJM AI and through this podcast. And I think part of it is just making folks aware that there are data controls on these models. And it's amazing to me when I talk to people who are using these models that they're not aware that sometimes by default, like your data is being used to train or update the models. And it's a setting that you can change, which is I think something that we

should all be aware of and we should say more. But also, that there's simple things you can do to eliminate some of the biggest risks, like giving a snapshot of the results that you actually like, a subset of the results that you actually need a second opinion on. As opposed to, it sounds like wild that I have to say this, but as opposed to the whole thing with your medical record number, and your address, and your name, and your social security number, if it's your W2, right?

Like, you don't need, just copy it, copy the subset that you need an opinion on. You don't need to put your social security in there, right? But I think there's, uh, there's, there's things like this that I think we have an opportunity to, as, as educators, I think, to also make that message a little bit more clearly understood by, by folks who are using these models in medical contexts. Raj, I wanna give you a chance to, uh, list an article that we published that you like.

Yeah. My immediate answers were gonna be the two that we already talked about, which were the upcoming RCTs around, uh, Ambient.ai. And then my next answer was gonna be the, uh, randomized trial of the generative AI chatbot for mental health. So, we are wildly Zak, so far, just 'cause I think, I think you're right. It's a, I think, what was the word you used? Bellwether. Right? This is a, I think it's a signal, an indicator of where this is gonna go.

And also, the, again, the, the dissonance between what institutions are saying and how people are actually using these tools. But then I have a third one on this list, so the first two are completely aligned with you. And then the third one is, and this is also another RCT that we published. This is a randomized trial of using AI for helping interpret spirometry in primary care. So, amongst primary care doctors, sort of getting access to more specialized interpretation using AI tools.

And I thought it was an interesting study. Well conducted, showed I think modest but positive effects on the use of AI for enhancing the interpretation of spirometry measures amongst primary care doctors. I think it addresses also a very important theme, or primary care clinicians more broadly.

I think it addresses a very important theme which I suspect is gonna be a big focus for folks in the next couple of years, which is using AI as sort of a superpower for either your PCP quarterback or your clinician, your nurse, yourself, right? But essentially a way to, as a superpower to extend the expertise and the access to specialist expertise for primary care doctors.

If I can highlight a paper that we didn't publish, that's a bit of a contrast to this one that I wish we had published because I thought it was an interesting paper. It's, uh, something that was published in as an original article published in Lancet Gastroenterology and Hepatology. The title of this paper is "Endoscopist de-skilling risk after exposure to artificial intelligence and colonoscopy: a multi-center observational study."

And I thought this one was really interesting because they looked at the effects on the capabilities, and of course this is, you know, it's not a randomized study, right? It's an observational one.

And so, with all those caveats that are there, I think it's still sort of first-in-breed of a rigorous kind of multicenter observational study that addresses this very interesting question, which is what is the effect of the introduction of AI on the skills of our human experts like our endoscopists at doing something very important, which is in this instance, detecting adenomas.

So, they, the metric of interest here is the ADR, the Adenoma Detection Rate, pre- and post- the introduction of AI assistance into colonoscopy reads. And they found a reduction, a significant reduction. You know, they tried to adjust against everything they could adjust against, and I think at the very least, it's very compelling and a bit nerve wracking, but I think very important finding that points to the potential changes to human skill as a function of AI use.

And we need only think about the sort of GPS in our car, right? Or on our phones, or some of the other things that we now kind of take for granted. And as long as they're always available. It's okay, right? That we are sort of de-skilling in a way. But if they're not always available, or if they're brittle, or if they have problems, or if they send us astray, it is quite worrisome that our expertise can atrophy even amongst our sort of experts in a medical context.

And so, I thought this was provocative. It was interesting. It addressed a really important, interesting question around sort of human-AI collaboration over not just a short term, but the slightly longer term. And I wanna see more studies like this about sort of the skills of our humans, uh, pre- and post- use of AI. And I'm sure we're gonna have randomized versions of this going forward, too. but I would also like to have a public debate about which de-skilling is okay.

For example, um, we probably are better at mental arithmetic than people who always use calculators, for example. And you know, I grew up in Switzerland. Half of my class went to England and the rest of us went to the United States mostly. And those of us in medicine compared to each other, I remember snotty comments from my friends who went to England about how, you know, Americans didn't know how to do a physical exam, which was true. And how we over depended on imaging, which was also true.

But now in, in retrospect, that's probably the right thing to do. And you know, being able to actually look into an abdomen that's supposed to infer the miasma that you feel through your fingertips seems like, not the right move. Which by the way, allows me to advertise another paper that I liked, which we published, called "Fetal anomaly ultrasound scan: a randomized controlled trial," where we found using ultrasound, AI-augmented ultrasound, we can actually get pretty good performance.

And this is one of my pet peeves that we're still teaching the physical exam in 2025 without ultrasound use routinely by medical students. And so, AI-augmented ultrasound, in my view, should be used today so that if someone has a pain, you can just look in their abdomen. And so, you know, we have to decide what this de-skilling is injurious to patients and health care and what is not. Anybody else want to volunteer some papers that they found they wished we had published?

Or should I mention some? Uh, I do. But as a segue to the next section, so Zak, you have written about something called the Human Values Project. Yes. Yes. And how that intersects with AI. Could you tell us what it is and what the agenda for the Human Values Project is? So, I believe we touched on this slightly in the last podcast.

I don't wanna go too deeply on it, but because of my involvement in this project, which I'll detail shortly, I've been reminded again and again how in fact values of doctors vary considerably. And Raj and I will publish, sadly, in another journal, an article where I recite the following anecdote, which is, I was in training, and I was at a cocktail party in Cambridge.

Where I guess the socioeconomic status was such that it was not surprising that a dentist was talking to a therapist about their boats, that they were in sail to Nantucket. But the dentist was noting that the only flying ointment was it was gonna cost a lot of money to prep their boat, but they had scheduled enough procedures, so they thought they would be able to afford it. And I was discomforted because that was my dentist, and he had just recommended to me a very extensive procedure.

And so, you always wonder whose values are being highlighted. But to make a long story short, here's what's very apparent, and I'll illustrate it. A. Human beings vary in their decision making priorities and preferences. But guess what? So, do AI models and we don't know it. We just don't know what the models, because these models have been aligned for some really catastrophic things like letting yourself kill yourself or telling you how to kill yourself or how to create a pathogen.

Their alignment has happened, but on detailed clinical decision making, it's not been specifically aligned, but turns out they are consistently different. And moreover, both human beings and these AIs are systematically different from our normative models of what decision making should be. And then question is, how do you actually do this alignment for different situations and different preferences by different populations?

To make it very vivid, one of our grad students in HST, Payal Chandak, did a study where she provided various scenarios, including one of a woman who had been admitted for anorexia: very low weight electrolytes, which were low, which were very worrisome and all sorts of reasons to be extremely worried about this woman. And she was refusing to eat. And the question was, should you wait another day? To see if she eats or put in a feeding tube.

And not surprisingly, or perhaps surprisingly, the human audience that she asked this of was split. Some were really worried enough that they overcame concerns around autonomy, put in the feeding tube, and the others were saying, this is really an assault on autonomy. Let's wait another day. And what's fascinating is these frontier models, both open-source and commercial, differed systematically between each other and cited reasons.

And the reasons they cited, I'm being incorrect, we didn't ask them what their reason was, 'cause just like human beings, large language models, reasons for why they, uh, make decisions, it's not the same as the things that actually cause the decisions. So only by changing the scenarios can we actually figure out what are the variables that are actually influencing them.

And it turns out that the external factors, the co-factors, the contextual factors that influence AI are not the same one as influencing the human beings. And I, again, I won't mean name the name, but one of the models, one of the models that you've used was much more aggressive than all the other models. Like it, no matter what scenario you gave, it was aggressive. In treatment? In treatment? What? For treatment and treatment recommendations.

Yeah. Yeah. The kind of resident that we'd say, oh yeah, he believes he'll always heal with steel, as we used to say. Mm-hmm. And, um, and, but that's an just an unknown bias. And what's up fascinating to me is these are concerns that are completely unaddressed by all the regulatory frameworks. And so, there's a mission here that the Human Values Project has: A. Is to document the, you know, both the human and the AI side of it and the differences.

What are the things that inform and what can we do actually to influence or align these models to be more like what we want 'em to be in the right situation. What has also been interesting to me is that there is no authority we can appeal to. I've talked to very large medical audiences. They don't believe that professional societies, medical professional societies are the right place for this to do that. They have no faith that they can actually deliver in this real regard.

The FDA is, in my opinion, overwhelmed. And the whole notion of, especially for generative, specifically for generative AI, which has such broad capabilities, and which would have to be localized to the local characteristics of multiple hospitals. Again, that seems beyond the capabilities of the FDA at the moment. In some sense, it calls for a super superpowered consumer reports for health, but that, of course, does not exist. So, it's an open question with the Human Values Project.

Once we do the work, who's going to actually be the advocate for good behavior for decision making that as patients and doctors and as a society, we would, could stand behind? Yeah, I mean, super fascinating. And it's interesting how values are kind of like an emergent property here because like you said, like almost surely, they have not been aligned to a human preference set in the tasks that you are evaluating them on. And so, these are like value interpolation exercises. Exactly.

The, the, the anchor points. And like, depending on what that looks like, it can go way left or way, way right. Again, just how the alignment was done. So, that is, like, super fascinating and I agree that there's a vacuum there for who should be the moral authority essentially for large language models. Cool. Okay. Um, so I think we're gonna move to the lightning round next. Zak, are you ready? No. One more. One more. Okay. Go for it. It's, it's, it's, it's gonna foreshadow the lightning round.

Um, okay. Here's a paper we hadn't published. It was in Nature in June 2025, "An instantaneous voice-synthesis neuroprosthesis." This is amazing because there's a bunch of inferences having this model that's looking at the different cortical paracortical activity to actually understand the modulation that has to happen in the voice. They're not using extensive other large language models, and yet it's giving fluent, nuanced speech.

And so, that gives me increasing confidence that the multiple companies, we all know about Neuralink, but it turns out there's multiple companies in this space working on neuro prosthesis, which for literally hundreds of thousands of patients are going to be life changing. And I wanna note that this is just a case I would love to pick. You know, this is not an RCT. It's a one case, but it's like the dog at the opera.

The fact that this happens at all is miraculous and we should applaud it, and we should publish it. So, if anybody has some great neuro prosthesis they want to write about that really made a difference for patients, come to us. I'll say, if you have a good brain computer interface paper, you'll never find a more receptive editor-in-chief than in Zak Kohane. So, if anyone is listening to this, Zak is always on the lookout for neuro prosthesis and PCI papers. So, that's it. Cool. Alright.

Let's do the lightning round. Lightning round. Okay. So, I think you know the rules here, so we'll just jump straight into it. I found this que–, actually, I don't think I know this one about you, which is, which is interesting, but I've also found it to be relatively revealing. Zak Kohane, what was your first job? My first job, my first real job was, um, I was inputting at, okay, well let me, let's just try three versions and we can decide after edits which one you like.

So, going backwards, my first job that actually paid an acceptable salary was a job running key punches to do data entry at the Boston city organization for payroll, where they annoyed the hell outta me by having an emulator of a card punch machine in an IBM terminal. But here's the, what was amazing about it was everybody there treated it like it was an actual real card punch. So, if they made an error, they'd have to re-punch all the cards.

Or if they had to do, they had missed one, they'd have to re-punch the whole pile. All they had to do was flip up under the keyboard and there was a bunch of instructions how to use insert. Mm-hmm. Global replace. They never did it. And when I actually mentioned it, it was like, you are an asshole. And so, so that's, that's number one. I guess that's not my first real job. The other, uh, I was a counselor, a consultant.

A consultant at Brown University terminal services around the campus where I got paid $125 an hour in 1980 to give students counseling. Uh-oh. What was my, what was the job that you thought you heard about? Oh, I, I, I had no idea. Raj, were you gonna guess? No, I didn't know. Oh, yeah. No, we didn't know. Okay, good. Neither, neither of us do. I thought you'd find some like horrible. Uh, no, no, no. Wait. Okay, now I know. Yeah. Okay. Now we're gonna go digging. There's we, yeah.

What is it, Zak? Yeah, yeah, yeah. It's like, you'll, you'll find this humorous. My answer to this question was I worked at a NASCAR warehouse. And I know that you'll, uh, enjoy. That would be, that would be on the brand you don't want to have. A brand that people would assume. Yeah. Yeah. I don't know. I'm just thinking about the Frasier and Niles episode. Uh, Andy Nascar. Yes. Yeah. Amazing. Cool. Alright, well next. See you, Raj. Raj, what was your first job?

Uh, I think the first job that paid money was, I, I can't, I mean, I don't know if this counts as a job, but it's, it's where my mind immediately went. I scanned photos for one of the professors that was my dad's colleagues for most of the summer, to digitize them and then create sort of a digital representation of his, like, family photos and memories when I was, I don't know, 11 or 12 or something like that. Did you get paid? I got paid. So, that, that's why it, it feels like a job in my head.

Yeah, it's a job. But yeah. Yeah. So that's a, that was my, that was my first job. Uh, alright, Zak, we, the next, the ne next lightning round question. So, we actually asked the same question to Marinka like a month or two ago when we were interviewing her. And so, I'm very, very curious, I don't know if I know the answer to this, I have a few guesses, but what is your ultimate productivity hack? My ultimate productivity hack. I know it's lying to myself in some way.

Um, creating fake deadlines. Yeah. And, uh, they're especially effective when they involve other people, right? Yes. So, the, so I think you've described this to me as like doing jujitsu on oneself, right? Sort of engineering it so that you have to get it done. Yes. Uh, so I'm extraordinarily, uh, guilt powered. So, if I can convince myself, which is all too easy, that I actually made a commitment to someone else. It's done. I have to do it. Excellent.

Cool. The next one, um, I think I have some guesses on this one, too, but I'll be interested to see what you say. What is your favorite place to visit or vacation? You are a worldly traveler if I do say so myself. So, I'm interested to see where this goes. Well, I have huge home bias and I love Geneva, Switzerland, and it's a beautiful city, some would say boring city, but that has a great, get older,

that's just a feature. Has beautiful mountains nearby that are great for skiing, for hiking, excellent cheese, great banking system. So, yeah, I love visiting Geneva. Shortly thereafter favorite places. Boy, you're, I'm having the whatchamacallit to the, uh, segfault. Yeah, segfault. Seg, yeah, the segfault, the phenomenon. Yeah. Then there's a whole bunch that follow in short order. So, I like surprisingly to my, I love the Santa Monica area.

Yeah. And it's just incredibly pleasureful, relaxed, and I find that contrary to Boston or the Bay Area, people are much less concerned about where their kids are gonna go to college, for example. So, I appreciate both the relaxation and the opulence, the food and the ocean. Nice. Alright. Zak, do you have, uh, or if you don't have, what would be your go-to metaphor for explaining how AI works to clinicians?

My go-to metaphor would be: imagine the most rigorous and conscientious version of yourself that kept it up to date with all the literature, never got sleepy, never got an attitude, and did your

best to integrate this and deliver the services requested. Because that model would both capture, I think, the current level of performance, assuming that the, my interlocutor is highly competent, but also the limitations because the lack of experience in the real world, the lack of embodied intelligence with all the roughness, the non-spoon feed version where you're not, you're getting all the information, not just the selective, relevant information that would actually make you

that, that best possible medical student, go astria as well. Awesome, Zak, next question. Um, What about the current AI trajectory concerns you the most? I think what concerns me the most about the current AI trajectory is that it's, it is in the value space.

There are so many billions of dollars at stake in decisions whether to give growth hormone, whether to do in the MRI after back pain, that inevitably there is going to be enormous pressures to even slightly align these models towards a financial end that may not be in the patient's best interests. And that will be problematic because it will be opaque to the patient even if it's not opaque to the doctor.

And so, knowing who you could trust to actually make the best decision for you is gonna become more and more problematic. Alright, Zak, this is our last lightning round question. What's the biggest cognitive bias that clinicians should watch for when working with AI recommendations? I still think it's their own biases, because you're gonna receive that advice recommendation fully loaded with your own recency bias, your own risk averseness, and so on.

And so, when we used to say, when you go into a code, the first pulse you should take is your own. And similarly, I think when you use AI first, think about your own biases so that you can properly evaluate whether there is unreasonable bias in the recommendations or assessments of the AI. Of the, of the human, the sort of Kahneman-Tversky biases that we're, we're aware of is like confirmation bias sort of at the top of the list or are there other things that you think?

Okay, confirmation and recency. Recency, yeah. And actually that ties in pretty well to this first part of the conversation here, too, with thinking about the false positives and when to adapt and when to try to avoid overfitting. Amazing. Zak. So, uh, you survived. Before we—. No, no, before we close. Oh, you have lightning round questions for us. Well, yes. Of course. Andy, make, give, give, give us some predictions in two realms, life sciences and clinical for next year.

What's gonna be, um, and you can also say there will be no interesting developments, but what's are gonna be the interesting developments in life sciences and clinical next year? So, we have seen early glimmers of scientific discovery from AI and I, I mean like scientific discovery in the way that you would imagine a human scientist doing it.

Not like something that comes out of a diffusion model, but something that looks like a well-articulated hypothesis and then confirmation on that hypothesis. I think most of the things that we have seen so far have been somewhat derivative or marginal. I think 2026 will be the year that AI for scientific discovery gets supercharged. Obviously, the company I'm working at is working on that, but I kind of know what Anthropic's doing. They're pushing really heavily into life sciences.

There's a lot of other companies out there either working in a specific area of science or science broadly. And I think that one of the ways that we think about this is that science will soon be subject to the Bitter lesson in that we've trained all of these domain-specific models, but lots of people are working on essentially one model of all of science and then connecting it to a laboratory. And 2026 is the year where that starts to feel more plausible.

And there probably will be one quasi-move 37-moment, um, in 2026. And then 2027 we'll have 10, then 2028 we'll have a hundred. And then the way science happens will feel radically different, um, after that. So, I think that 2026 is the first year we get that. In medicine, the big trend in AI has been long-running jobs. So, AI agents that can go and do a day's worth of work, a week's worth of work. We haven't really seen that in medicine yet.

We've seen lots of like essentially point- of-care decision support, stuff like that. My guess is that there's a lot of money to be made in doing a lot of the harder to do, long running, call someone on a telephone, negotiate with them, go back and forth, do the whole rigmarole that's involved with the delivery of care. And my rule of thumb for AI and medicine for the last three

years is

go where the money is. And so, if people can hand off, I think Raj asked me this question when I was on, but like what currently can't be done? I think that a lot of these things that require back and forth and integration of a lot of different systems, so my guess is that someone's gonna crack that. It's probably someone who already has a foot in like one of the ambient dictation services will expand their offering to capture more and more of that value chain.

Raj, I wanna, it's hard to, you know, he, he, those were two good ones. Do you have, so it might, you might not have any new ones. Um. Predictions. In 2026. I think. I think I'm also like Andy, bullish on the sort of AI scientist area, although maybe where I am, I defer a little bit is, and maybe actually, maybe we don't even defer here, Andy. I'm not, I'm not sure. But where I think I spend a lot of time thinking is when will we know that, like we've actually achieved that moment, right?

Because it's not gonna come from the people who are developing the AI scientists, they're not the people who are gonna vet it, right? They're gonna, they're I think, very interested in promoting the outputs of their AI scientists.

And I think, Andy, my guess is would agree with me that the discoveries, which maybe I'll put in quotes to be a little bit curmudgeonly, that have come out of our AI scientists for the last couple of years that are very incremental or not particularly impressive, are being sort of touted as a bit more than they are from this sort of current generation of LLM. Plus, a little bit more AI scientists.

And so, I, I know you guys are working on other paradigms here that have this sort of physical component. I think other folks are also interested in that and investing heavily in that. And so, I think there will be it'll be very interesting to see sort of how all of that unfolds. But I think where I have no idea what the answer to, to this will be is when we'll know and we'll all sort of collectively agree that these AI discover like what that move 37 will be, 'cause in—.

So, I can give you a concrete example. Yeah. I agree with it. Like a lot of them are Rorschach tests to see how much you wanna believe that this is a discovery. So, the best current superconductor is something like 10 to 15 degrees Kelvin. So, if you had a superconductor that would be superconducting at 50 Kelvin, that would be like an unambiguous-like discovery.

And I think that there are categories of discovery that aren't just drug repurposing in disguise, where if you have a physical lab that you'll be able to sort of unambiguously demonstrate some of that. But I agree that for the most part, a lot of this has felt like dog at the opera. Yeah. To use a Zak phrase and we haven't really had something that was like, oh, we, someone has made a superconductor that can operate it twice the

temperature as our best current superconductor. To, to, to connect it to your other favorite topic though, I would agree that, like, if there's a new superconductor that breaks the existing state of the art. Amazing, right? Is a genuine discovery. But attributing it to AI is still a non-trivial thing. And so, your, your other favorite topic is like, what was the role of AI versus just $500 million or a billion dollars and a lot of experiments being run by robot arms in parallel?

Like, is that AI? I, I guess so. Like, I think this is—. So guys, this is, I've, we've had a recent controversy which sort of opened my eyes to why this is not perhaps the most productive conversation. And that is the death of James Watson and attribution of the discovery of the double helix. And there was a woman, Rosalind, who was extremely important to that discovery and with his death as a back and forth, there's a relitigation of that. And listen to it. You can, it's a rashon.

It's like you can, depending on what you want to consider as intellectually important, intellectually critical to the discovery, you can view it differently. And I think that's gonna be the same true, the same thing is gonna be true of AI. I think that part's gonna be, ultimately, there'll be a shift with time where a whole bunch of discoveries are being made that would not have been made.

And we might not be able to pinpoint exactly what the contribution of AI was but we'll know that whether it was a qualitative or quantitative contribution, it's just elevated the, the, the whole thing. And so. Yeah. And I, I mean, I think, to be clear, I'm very bullish on this. I just think, uh, I, I think it's worth thinking about. I do think credit assignment is more important here than it is in most debates about it. But it's very hard. I think it's very hard. It's always very hard.

Yeah. But I think if we're, if we're saying that there's something that's coming outta the AI and the AI scientist, the existence of the AI scientist is, uh, is only proven once there's an actual AI scientist, right? That's making the discovery. But you know, we used to, um.

Back when I was in grad school, we, we used to all joke about how, you know, the mathematicians were the highest and, you know, there was a pecking order of how smart you were going down to, uh, you know, through engineering all the way geography and, you know, God help you, um, and anthropology. But that's very much again, in the eye of the beholder. And I think that more historical shifts will tell us just how, how important this—.

And I mean, we'll be able to estimate something like the causal effect of people talking about AI scientists in discoveries per unit time. Like, I think that there will be like, global indicators that will help us understand how AI is impacting scientific discovery. Um, even if the credit assignment in any specific case is difficult or impossible. Yeah, yeah. Yeah. Agree. I buy that. I agree. I buy that. Yeah, I buy that.

So, the pace of scientific discovery should increase if AI scientists are good at being scientists, right? That's right. Correct. Okay. Zak, we have one last concluding question for you. And this is all right, just give us actually, I think this is a fun question to ask with me and Andy on the phone here with you, too. Uh-oh. So, you, you mentored both of us, and I think we're, uh, we've both said this before, we'll say it again.

We're both incredibly grateful for your continued contributions to, I think, shaping our values and, uh, helping us move through our careers. You know, long after we sort of wrapped up our official PhD or, or postdoc. It occurs to me that we're at a very interesting time where we have a skillset that could work really well in academia as it does for many people, or could work really well in industry. And I think Andy has really shown both sides of this, certainly.

And so, maybe we can ask this as our final question. What's the strongest case for staying staying in academia during the current AI boom? And then, by contrast, what is the strongest case, from your perspective, Zak, for not staying in academia right now? Alright. I think the strongest case to stay in academia is that unlike a business, the individual is not forced to make huge bets in intellectual bets and focus monument likely on those bets, we can actually, uh, spread our bets a little bit.

And for those of us who don't like to be completely heads down, which makes it, which is it seems to me almost a necessity to be successful in business, then the ability to explore is an absolute, uh, privilege and luxury that's afforded to you by, um, by being in academia. Why not to be in, um, academia?

Well, unfortunately the best reason just got, uh, apparent, which is the cutback in the number of PhD students, and I think that is a, if that is a persistent indicator, it's extremely bad prognostic for academia. If it's just a blip, then a few institutions will suffer from that blip, but overall will do well.

But if that is a systematic decrease, especially in the sciences, I think it's a very, very problematic prognostic for the best and the brightest staying in academia because the reality is graduate students are absolutely intellectual fuel of innovation and inspiration in academia. And, oh, perhaps another reason not to be in the industry is I do think I'm extremely bullish for AI. At the same time, there's got to be a real bubble out there, and some of the investments just don't make sense.

And there'll be an accounting, but again, we've been through this before. I certainly was through the um, dot com, uh, collapse, but some of our largest companies emerged from that collapse and existed before the collapse. I'm not that, that worried, about it. And at the same time, if you are a young, you've heard me say this before, if you're a young doctor just graduating in today from medical school, but you have impressive quantitative AI skills.

It is not obvious to me that you will have greater impact on medicine and on health care by staying in and doing your residency so that 25 years later you're gonna get your first RO one, your first independent investigator award as opposed to using all those smarts and not wait 25 years and get the respect and the attention of health care system leaders who are desperate for solutions. Because we are in a very rickety, uh, configuration of health care right now.

As we said before, super high revenue, very low margin, and so, the. Disruption from the, from within is much less likely than disruption from without. So, that's another take on where best to stay. Personally, however, I could not enjoy more being in academia and having delightful, uh, students and colleagues. But if I was 20 years old today, I don't know exactly which way I would take. Amazing. A lot, a lot to ponder. Thank you so much Zak, uh, for being on AI Grand Rounds.

This was a pleasure. Thanks, Zak. Alright, guys. This copyrighted podcast from the Massachusetts Medical Society may not be reproduced, distributed, or used for commercial purposes without prior written permission of the Massachusetts Medical Society. For information on reusing NEJM Group podcasts, please visit the permissions and licensing page at the NEJM website.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android