Can AI Accelerate Science? Dr. Andy Beam on AI’s Next Frontier - podcast episode cover

Can AI Accelerate Science? Dr. Andy Beam on AI’s Next Frontier

Jul 16, 20251 hr 7 minEp. 32
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Dr. Andy Beam has trained models, mentored scientists, and used data to quantify the value of treatments. In this episode of NEJM AI Grand Rounds, Raj Manrai turns the table on his co-host, reflecting on how Andy’s childhood misdiagnosis, and the failure of human recall, revealed the diagnostic promise of machine learning. As a Harvard professor, he mentored hybrid thinkers and built tools to evaluate safety, not just performance. Now CTO of Lila Sciences, he’s building an experimental AI system to generate its own hypotheses and test them in the real world. This conversation is a front-row seat to the next evolution of science.

Transcript.

Transcript

Are these robots in a room? What is the experimental side? Yeah, they are robots in a room. They are disembodied robot arms in a room. We have a system. We have an automated experimental platform where if you're familiar with how experience work, they often work on plates. So, either like a 96 well plate or a 384 well plate. These plates magnetically levitate over this planar motor system that we have, and they can zip next to this big rail.

There are benches with experimental equipment on it, and the robot arm will pick the plate up off the rail, put it in the piece of equipment when it's done, put it back on the rail, and then the plate can zip off to the next stop. So, the abstraction that I have for this is that, actually, we're building this new kind of computer. And that this planer motor system, this rail, is essentially like a PCI bus.

And what we're doing is hooking new devices onto this generalized PCI bus in the real world. And the idea is not to have a couple of these stations that can do what you can do, it's to have buildings of these stations that can do experimentation at scale, and then it really does start to feel like a new kind of experimental cluster that we can pair with a traditional GPU cluster. Hi, and welcome to another episode of NEJM AI Grand Rounds.

I'm your co-host Raj Manrai, and for this episode I had a lot of fun because I got to turn the tables on my good friend and co-host Andy Beam, who is the guest of today's episode. Andy and I have known each other for a long time. We were postdocs together at Harvard, literally sitting in cubicles next to each other, and then we went on the academic job market.

And started our labs around the same time. Until last year, Andy was a professor at the Harvard School of Public Health and now he's the CTO of Lila Sciences, a company working on scientific super intelligence. So, I know Andy really well, but I still learned a lot of new things about him during this conversation, including about his early experiences in health care.

I was struck by how he got interested in medicine, his decades long fascination with artificial intelligence, and his predictions for AI, both in medicine and more broadly. The NEJM AI Grand Rounds podcast is brought to you by Microsoft, Viz.ai, Lyric, and Elevance Health. We thank them for their support. And with that, I'm delighted to bring you my conversation with Andy Beam. Alright, Andy Beam. Welcome to AI Grand Rounds.

I get to say that this time. So, Andy, let me, let me first say I'm truly excited that I get to pose this question for the first time to you, and I think I could probably simulate this reasonably well given how much time we've spent together over the last decade. But Andy Beam, could you please tell us about the training procedure for your own neural network? How did you get interested in AI and what data and experiences led you to where you are today? Yeah, it's funny.

I'll try and give you some new information here, Raj, but you probably can predict a lot of this trajectory. So, you know, I was always kind of like an engineering nerd as a kid, I wasn't really interested in medicine. My mom tells this story that in kindergarten they were doing this new experimental setup where they had stations and the kids could rotate through to reading arithmetic.

And I literally spent my entire kindergarten year at the Lego station to the point that I couldn't write my name after kindergarten. So, I've always just been interested in tinkering and building and engineering, and it was really in high school when I started to think a lot about computer science and computer engineering. I got this Dell. Uh, dude, you're getting a Dell.

For people who remember those ads, uh, with a Pentium three, like 733 megahertz, and I just completely got rabbit holed by all the things that you could do on a computer. I got into, like, hardware hacking a little bit. So, I made like a little

side hustle in college modifying Xboxes. So, I could take someone's Xbox, you could solder two jumper points on the motherboard that would let you flash the bios and essentially turn it into a general purpose computer so you can install a bigger hard drive. You could run super Nintendo games on it. And I made like a decent amount of beer money my freshman and sophomore year modifying Xboxes. You'd come into like my dorm room and there would just be like a stack of

Xboxes, like floor-to-ceiling. Because people in my dorm would bring it by, I'd charge 'em like 50 bucks and modify their Xbox. In high school, I also took my first programming class. So, I took a Qbasic course at a local community college, and I think after that experience I was just like, computer science is like what I want to do.

It spoke to, like, so many of the things I was interested in. And really, I think that's sort of been the guiding principle, that's the field that I most resonate with. I'll keep going through my trajectory here, but I really have changed fields a lot. And I think that though I still view most things through the prism of computer science but informed by some of these other things that I've been working on.

I will say I have this formative memory, that got me interested in medicine and the intersection of computer science from my childhood. I was actually, I got sick a lot as a kid. I had strep throat like four times a year. I had chickenpox and shingles at the same time. So, I was like in and out of doctor's offices a lot growing up. I also got like a stick stuck three inches into my quad playing wild goose chase in my neighborhood.

So, I like, I had like lots—. I don't think I, I don't think I knew the stick in the quad, I think. No. Wow. Yeah. So, you have these strong, strong experience with the health care system even growing up very, very young. Yeah. And there's one of these that I think came back to me later in life that reinforced the potential for AI in medicine. And it was in sixth grade, so it was my first year of middle school. And like most nerdy middle schoolers I went to space camp that year.

I actually went to also something called Spec Camp, which in North Carolina was like the academically gifted nerd camp that you went to in the summer. But I got home from space camp and I started barking like a dog. I had this cough that was, like, very much like barking like a dog. It was the strangest thing my mom had ever heard.

We were at the beach, and it continued and continued and eventually one night, I coughed so much that I became as asphyxiated and couldn't breathe, and it eventually just went into vomiting. Sorry, that this is kind of a gross story. Oh my God. Oh my God. And it was just traumatic. Like, it was traumatic. And so, my mom took me to the ER. They had no idea what it was.

They took me to the pediatrician the next day. And in the pediatrician's office, I went in the exact same full spell, like full coughing, bronchospasm, emesis, right in front of the pediatrician. And the pediatrician was like, you know, uh, I think you have a sinus infection. My mom was, like, this kid does not have a sinus infection. Like, I don't know what he has, but he does not have a sinus infection.

So, I, like, had this for several more days and my mom woke up in the middle of the night and had this flashback to when she was a kid. And she remembers being in the car with my grandparents, her mother and father. And the same thing happened to my grandmother. And so, they had to pull the car over and my grandmother vomited on the side of the road and my grandmother had whooping cough.

And so, I was actually having textbook presentation of whooping cough, but the pediatrician had never seen this during their entire practice. They had never seen this in their life. Whooping cough had mostly been eradicated, and so, the next day my mom called the pediatrician, and she was like, do you think Andrew might have whooping cough? And they're like, well, it's funny you say that.

We had just got this random call from the CDC and there've been a couple documented cases of whooping cough in other parts of the county. And so, turns out I did have whooping cough. I got these huge horse pills that were terrible to take but remedied it. My dad's a dentist. And so, they actually shut down his practice for a couple weeks due to the CDC, came to my dad's practice, and came to our house, and kind of like did a full canvas.

But what this told me was that, you know, I had, like most people growing up, this sort of like reverent view of physicians, that they are some mix of being members of the clergy, but also have some non-trivial amount of omniscience and can correctly diagnose everything. But the reason why my pediatrician was unable to diagnose is that they had never seen it like it was a textbook presentation.

So, if you're looking at the conditional probability of whooping cough, given my symptoms, it would've been close to one. But the fact that there was this recency bias with the pediatrician, they just had a blind spot. And there's a very, like human well-studied like cognitive bias, like recency bias. So, that stuck with me for a long time, that there was this flaw in the way that people think about diagnosis.

And so, as I moved through undergrad, I was studying computer science, computer engineering, electrical engineering, trying to decide what I was gonna do. I thought about being a network engineer and going work for Cisco. I was interning at Qualcomm doing very large circuit design verification for the Snapdragon processor. So, this is, this is all in all, all in North Carolina at the time, right? All in North Carolina, yeah. At NC State.

So, I went to undergrad at NC State. And so, I thought that I was gonna do one of those two things. And then I took an AI class, the undergraduate AI course at NC State. Uh, it was the green, uh, like, modern AI book from Russell and Norvig, like the classic textbook. And it was just like the completely, the most mind-blowing thing that I had subject I had ever seen.

We talked about the Ship of Theseus, we talked about, like, all of these philosophical issues, like what does it mean to be conscious, but then also the, like, very practical things, like, how do you search over large spaces with things like A*? How do you do theorem proving? And it was like just essentially an amalgamation of all the subjects that I found super interesting and really was a hard fork for the trajectory of my life.

So, I decided that I wanted to do AI. Didn't wanna do those engineering things. And so, then I tried to work backwards from that realization and figure out what were the most exciting things that I could work on. And ai, and again, I had this like flashback to this whooping cough episode in sixth grade and said, like, medicine has to be one of the most impactful things that you could work on and also has all these interesting properties.

So, I then decided, spoke to a lot of my professors, asked them for their advice. I got very sage advice from the same AI professor who taught the course. This was in like 2006 or 2007, and said, you know that this thing called machine learning really seems to be important. It seems to like be on an upward trajectory. So, if I were you, I would go like really understand probability theory and reasoning under uncertainty and really get deep in that. And so, I took his advice.

I ended up staying at NC State and getting a master's in statistics. Also doing some research at the EPA. Learned a whole bunch about the foundational theory behind probability theory. Lots of super interesting stuff. Finished my Ph.D. at NC State in bioinformatics doing Bayesian neural nets for genome wide association studies. This was Bayesian neural nets, before auto grad was a thing. GPU computing had just started. So, I was, like, writing CUDA kernels by hand. Writing the back prop by hand.

There was no auto grad. Anytime you made a change, you had to go back to your code. This is a, this is a really good, back in my day. Yeah. It really is back in my day. And so, I again learned a lot about low level deep learning 'cause these were Bayesian neural nets. Yeah. And just like a lot of the sort of nitty gritty about how to train those models. I was part of a two-body problem.

Long-time listeners of the show know that I'm married to a physician, and so, she had gone out on residency, interviews and pediatrics sort of all across the country. I kind of tagged along and looked for postdocs for the places she interviewed. Boston seemed to be the clear winner in terms of the two-body problem optimization. And my Ph.D. advisor was someone named John Doyle who came from MIT and had worked with this guy named Zak Kohane.

So, towards the end of my Ph.D. I started talking to John about, like, who's working on the frontier of AI in medicine. He's like, you should really go talk to my friend Zak. So, while Kristyn was up here interviewing, I went by DBMI, what was actually CBMI at the time, the Center for Biomedical Informatics, and had a talk with Zak, and he was just awesome. He was like exactly the perfect postdoc mentor that I was looking for. Also, like, completely understanding about the two-body problem.

So, he was like, I understand how this works. Like, we'd love to have you up here. If you guys match in Boston, let me know and I'd love to post you for your postdoc. So, we went to the match day, you know, really feels like the NFL draft. We actually like brought hats from the different cities with us. Her family was there, my family was there, got up on stage, got the envelope that said Boston. And so, I immediately got on my phone and I told Zak that we were coming.

Came to Boston, spent like three awesome years with Zak doing a postdoc. Really in the early days of medical AI. I remember I showed up on the first day of my postdoc and it was like, Zak, we need to get some more GPUs. This was like 2014, and he was like, why? I was like, neural nets are like going to really change almost everything. He's like, I trained neural nets in the year 2000. What's different now?

I was like, well, these GPUs, essentially, each one of them has, like, more computing power than the national labs had in the early 2000s. Like, so, it's, it's just very different. So, he immediately got it, and we started working on lots of stuff in the space. So, maybe just a few more points before, I actually don't know what you have in store for me, so, I'm excited to see what questions you have.

Yeah. Let me redirect it now, 'cause I think we're gonna talk about a lot of your work after your postdoc. Yeah. For the next part, I have so many reactions. So, the first thing is, I think I could simulate a decent amount of that, Andy, but you definitely put in some new content there. And I have so many reactions.

One of them that is really, I think, really important is, you know, you're talking about this story where you had eventually you would go on to be diagnosed with whooping cough and, you know, very sort of strong memory that stuck with you and that led you to medicine and led you to work on problems in medical AI. There's so many things about this, right? Your mom was very involved in this. It's a loved one of the person who's suffering.

This has been a persistent theme, and then you were misdiagnosed first, and now I'm sure the current version of you is wondering – if you haven't already done this – you know, how would ChatGPT or similar models have done with that presentation? My guess is it.

It would've been, uh, quite high on its differential diagnosis, but—. For sure, I definitely have, and it was actually, I mean, you know, some of the early models that we did during our during my postdoc, like we would give you the word-by-word change in probability. Yeah. Whooping cough was always one that I would use to test it. So, so even in the primitive era before current large language models, it was, it was high up there. Even the LSTMs trained on small data could get it right.

Yeah. Yeah. So, this is what I want to dig into next. So, this is your academic work. You know, I think you're about to go here, and so, maybe I'll try to briefly, I think, pick up where you left off. And I do have to say, I think that was a fantastic answer for, also, the genesis of, your interest in AI and medicine as well. So, you did a postdoc with Zak. Very successful. Actually, maybe I'll just give my comments on this.

You know, we spent a lot of time together with neighboring cubicles to the point that I — maybe we've mentioned this on the show before — but we were so loud and having so much fun, just joking and distracting each other during the day that we eventually did get separated. We eventually, I mean, we essentially just took our postdoc and put two mics in front of us and started talking into the podcast. Yeah, exactly. So, that is now AI Grand Rounds. Yeah. And it's a lot of fun. And so, yeah.

So, we, you know, we had a lot of fun talking about everything right. From LeBron James to AI to deep learning. Zak, Zak Bingo. Zak Bingo. Zak Bingo was always very fun. Uh, well, some of our listeners know what that means, but do you wanna, do you wanna tell them what Zak Bingo is? Uh, Zak had a lot of phrases that he would commonly use in talk. One was using an analogy of Netflix knowing everything about you, but the health care system knows nothing.

So, that would be a square on the Bingo card. Information Theoretic Criterion was another one. So, we had a whole bingo card for Zak-ism's. Oh, amazing. You just took me straight back to postdoc. So, one of the things that I wanted to bring up from your postdoc work, so I think you were very prescient with this, and I think it's been a through line that in your work and your academic work afterwards, you were saying things that now we would take for granted.

But you were saying them very early on, which I think felt much more like science fiction when you were saying them back in, I don't know, less than 10 years ago, even 2018, 2017. So, you were trying to solve this problem, as I remember it, of designing a neural network to pass the USMLE.

And of course, we know now that every other large language model can do this, but at the time, what was the sort of motivation and also, what were the reactions to some of this work when you would tell either doctors or machine learning researchers? Yeah, I mean, even going back to like when I was in grad school and Kristyn was in medical school, it just felt obvious to me that computers were gonna be able to do diagnosis better than people.

And I would tell her friends that I was super unpopular at the med student parties because I felt like the bringer of the apocalypse is kind of like, I'm sure how they viewed me. Why was it obvious to you? Why, like, what was, like, what are the sort of the deep reasons that you saw it as inevitable?

Well, so you can go full first principles here and just say, like, if you think that cognition is, if there's not any non-physical component to that, then surely, we can recreate those processes in computers. And computers have scaling properties that human brains do not. So, it was just a question of if, not when. This is also, like, I was reading lots of futurist literature at the time. I always tried to be slightly more sober than like what you would see in that.

But it just, if you plotted out where we were in like 2009, 2010, it just seemed like, I didn't know exactly when, but it just seemed inevitable that computers are gonna be better at doing these types of deductions than humans could ever be. Computers don't get tired, they can read the entire Internet. They have perfect recall. It just seemed like a complete mismatch in terms of capability. So, that was something that was core to my belief, like in early grad school in like 2010.

And I think that still, it's mainly a core belief that I have. So then during the postdoc, Zak was actually very supportive of like these types of like semi-heretical views. Like Zak loves to, I think —. He loves these, right? He loves, he loves it. Yeah. And so, we started working on this project, and Kristyn was, had just passed step one and was working on, like, step three I think was what you take during your first year of residency.

She had done step two at the end of med school, and so, like, I always joke that it was due to a complete lack of imagination that I was just, if I wanted a computer to be good at medicine, I was just gonna have it do what she did, like take, and I, Raj, you have done a good job at articulating this point, too. Like, step one is a necessary but not sufficient condition to be a doctor. We would get these questions all the time.

Like step one, I gave a talk at GTC in 2017 that, like, step one should be a benchmark for medical AI. Step one has all these properties. People were really interested in getting computers to do differential diagnosis, but it's very hard to grade a differential. It's very hard to get the data, like, there'll be disagreement as to like what the correct differential should be. But these questions are canned with unambiguously correct answers.

There's a whole bunch of human performance that you can get as to how well humans do on this test. There's exactly the kind of data that you would need to train a model to be able to do this. And so, from my perspective, step one just felt like an obvious benchmark for not only medical AI, but AI generally. So, I kind of laid this out in a 2017 GTC talk, and then we started working on it.

We got some funding from the Robert Wood Johnson Foundation, and I had lots – I'll talk about this later, but we got bitter lessened, and it was my sort of one of my first exposures to the Bitter lesson. The idea was like we could train LSTMs on data that we had curated from the Internet where you can kind of like make example step one questions, and you train it just like the correct answer. And that was the plan.

And we were training bottles on tight Nexis, like, tiny GPUs by today's standards. And the models weren't bad. They were able to get like 40-ish percent of these questions correct. Which, to my knowledge, at the time was like one of the strongest results someone had had on that. They would do these cute things, too, where they could like, they would give you the sort of word-by-word probability of diagnosis.

So, you could feed in a patient case and you could kind of watch it think in real time about, you know, one of the ones that we always used was Kawasaki disease. And it would be like a 1-year-old patient has had a fever for four days. Evidence of strawberry tongue and as soon as strawberry tongue showed up, like the differential just collapsed in terms of entropy. Like Kawasaki would jump way up and all the other things would go down.

So, it was kind of neat that you could kind of like get at some of these, see how the model was reasoning. But ultimately, they were very, very small models trained on limited amounts of data and therefore the ceiling of those things were, like, pretty low.

So, I think what it taught me from that was one, that the medical AI intuition that I had had for a long time was right, but also, that if you, in this sort of new era of AI, you should always be working on the most general form of the problem. I thought that I was working on a relatively general version of the problem, but it actually, it turns out, that the more general version of that problem is predicting the next token.

And so, when GPT-3 and GPT-4 came out, they kind of solved these problems outta the box, like you alluded to, like people are now completely unimpressed when a model — It's amazing. — came to nine. Yeah. Yeah. It's amazing how much the goalposts have moved just in the last few years, right? Yeah. Both for what it means for the quote unquote intelligence that's in the computer models, but also for what it means for humans, right?

The whole conversation even around the significance for a human passing these tests has changed once AI has cleared them with ease. So, you were doing all this interesting work, right, on the USMLE benchmark on other tests, and I remember we went through this together. Then you went from postdoc to starting your own lab.

And so, you became a professor in the Department of Epidemiology at the Harvard School of Public Health, and you were continuing, I think, your methodological work, but you also started to work very focusly. And you can tell us about the sort of origin of this, although I can guess some of this, right, on problems specifically within neonatology, right. Applying AI to neonatology.

So, maybe tell us about Beam Lab and, you know, life as a junior faculty member, how you got it off the ground, what the philosophy was for the group. Yeah. So, I was excited to join the Department of Epidemiology. Again, motivated from the AI perspective. So, this was like 2018. So, I started my lab July 1st, 2019, and this was still very early this, so this was pre GPT-3. That we had run out of steam. For the, like AlexNet supervised learning, everything is a Cognet problem paradigm.

And so, there really was a sense that, like, we were looking for the next paradigm. And what the Department of Epidemiology at Harvard does like better than almost anywhere else is causal inference. And so, I was excited to join the department to learn from folks like Jamie Robins, Miguel Hernan, and folks like that, who are world leaders in causal inference, to see how we can get some of that type of causal reasoning into AI systems.

And we had a couple like really great papers on that, some of which were at NurIPS and ICML workshops about sort of blending causal inference with deep learning. The applied side of my lab has always been focused on neonatal perinatal medicine. Again, due to a complete lack of creativity on my part, Kristyn went on to be a neonatologist. And so, we've ended up working together a lot,

collaborating a lot. And, I think, to the credit of you and to her have been big influences on my academic career, we did a lot of more traditional epi health care, data science kinds of things under this umbrella of neonatal perinatal medicine. One of which that I'm really proud about is we looked at, we have a series of papers looking at this drug that's thought to prevent preterm birth. So, preterm birth is babies born before 37 weeks of gestation.

About one in 10 babies in the U.S. are born preterm. It's one of the biggest sources of neonatal morbidity and mortality there is. So, like, preterm birth is a big problem. And historically there's only been one drug that you can use to treat it. It's called 17 alpha hydroxy progesterone, or 17-OHPC or 17P for short. The efficacy for this drug was demonstrated in a 2003 in ICHD trial that maybe I'll come back to in a little bit.

But it was, it was NIH trial that was run, administered for a long time as a compound, compounded medication, and kind of, like, was standard of care for women who were at risk for preterm birth. So, the indication is actually recurrent singleton preterm birth. So, if you have a history of preterm birth and you're currently carrying a singleton, you're eligible for the drug.

So, as part of like my interest in getting into AI and machine learning, in Zak's lab we had access to this amazing clinical insurance database that had the lives of 40 million Americans over eight years. When we got access to this data, I was like, we're gonna like machine learn the crap out of this, and we're gonna predict all of the things, and we're gonna like, create the like, world's best AI system using this huge database.

So, it was instructive to understand how misplaced that enthusiasm was for this kind of data. So, one of the things that you learn in health care is like, not all data are capable of answering all questions. And so, I spent about a year of my life, really going deep and figuring out, like, where all the warts were on this data, and trying to figure out what types of questions it could support.

I started collaborating with a maternal fetal medicine doc at Beth Israel just to try and like get some clinical feedback on these ideas. She's like, you know, this machine learning thing is great, but really there's this question that we have no idea how to think about in maternal fetal medicine around the 17P drug.

In the year 2011, a drug manufacturer had acquired the rights to 17P and started reselling it under the branded name Makena, which at first people were pretty excited about because it would increase access. But they started essentially charging an arm and a leg for something that previously had been essentially free under this sort of brand name Makena. So, she's like, we would love to, like, understand more about the economic impact of this.

And also like there's a lot of controversy around like, does this drug even work? So, we ended up writing a series of papers. The first paper was in JAMA Internal Medicine, just, like, looking at how much patients are being charged for this medication. And so, we found, on average, the price per pregnancy for Makena was something like $11,000. And on average the price per pregnancy for the compounded version was $200.

So, something like a 5000% increase with plausibly, like no meaningful benefit given to the patients. Like, the differences and outcomes between the compounded and brand name version of the drug were essentially identical. There's no difference. So, then we did a follow-up paper where we used ideas from causal inference, and this is where it was super helpful to be in the epi department to do something called target trial emulation.

So, this is where you write down the inclusion criteria, the study design, just like you were doing in RCT, they use observational data to try and emulate that in your dataset. And so, there was a parallel RCT going on now that the manufacturer had to do to be able to get the approval renewed. And so, we followed that inclusion criteria, we followed that study design, we did the target trial emulation and found essentially no evidence of benefit.

And this was like a very robust finding across lots of different kinds of sensitivity analysis and just felt like very solid. So, we published that in a perinatal journal. The FDA, then after this trial came out to the second, subsequent, RCT for Makena was actually negative. There was some maybe subgroup effects. The FDA reviewed this and decided to remove authorization for this drug in the marketplace and cited our paper as one of the key pieces of evidence that they used in this decision.

So, like, I'm never super excited about patients having fewer treatment options, but I think that this was an instance where we could actually use some of these data science methods to have clinical impact. Because if a drug doesn't work one, we shouldn't be paying $10,000 for it. And two, there's obvious side effects, too, with a lot of these drugs.

So, there's so much there, maybe one of the things that I'd love for you to just dig into a little bit more is, you know, you said something along the lines of knowing what data can support what questions, right. How to align different data sets with different questions.

And, in some sense, I think this is what really separates the quality of a lot of research, which is not that you're, you know, of course there's, there are data sets that just in general are superior and wonderful and useful for a lot of things. But I think knowing that marriage between the data and the question, and maybe we can also add the compute to the mix of this, is really the sort of art of setting up a student for success, right?

Or working with a student to come up with an idea that is likely to be fruitful and interesting. And so, you know, you've got your lab off the ground, you're publishing these interesting papers in neonatology. Continuing your methodological work around causal inference and AI and then growing your lab, recruiting students. Maybe you can just reflect a little bit and then I wanna transition to your work now and what you're up to these days.

But you can reflect just before that on how you approached recruiting students and then mentoring them and designing projects for students in your lab. So, like, what was your philosophy? I think a lot of people who are, you know, junior faculty are interested in this kind of stuff as well. Yeah. Let me first qualify and say that when I started my lab, it was a particularly crazy time in my life, personally and in the world generally.

I started on July 1st, 2019. We had our first daughter, July 25th, 2019. So, full 25 days into starting my lab. Seven months later, COVID happened. Daycares shut down. Complete insanity. My wife got conscripted into a lot of IC service at MGH in Brigham 'cause she was still a fellow, so she was covering a lot of the pediatric ICUs while the pediatric ICU docs got conscripted into the normal IC. So, I have a partial and fuzzy recollection of essentially the first two years of my lab.

But let me try and give you a sense of how I thought about it. I viewed my lab to be a place where computer scientists who are deeply interested in health care could come and work on important clinical problems. So, again, that, I think that the 17P project is a good example of that. That was led by a student in my group, Joe Hakim, who is an HST student. So, HST has been featured a lot on this podcast already. But a bioengineer by training.

And so, he was interested in making a clinical impact. He would go and meet with the MFM doctors and really, like, dig deep into how can I map your clinical definitions onto what the data can actually answer. And it was always kind of like a 50-50 split. So, folks who are coming from purely computational backgrounds also supervised a lot of residents, medical students and people like that. I do think that there often had to be like a very sincere interest in AI and machine learning.

So, I would sub-select on folks for that. That if we just wanted, like, we weren't doing a lot of like RNA seek analysis in my lab. It really had to be something, a clinical question that you could answer with a large health care data set and ideally some type of machine learning approach. I tend to be relatively hands off when it comes to day-to-day. We would do some things that would be organized in a much more structured kind of way.

So, we have a NeurIPS paper on something called proximal inference, which is a subset of causal inference. And we ran that very much like in two-week sprints where like, here's what we're gonna do for the next two weeks, we're gonna check back in. That project, I think was —. the machine learning conferences are great for encouraging those sprints, right? Yeah, exactly.

Yeah. Yeah. But for the most part, I tended to also start graduate students on a shovel ready project that was, like, here's the project, here's what success looks like. Go and execute it. Project number two is much more of, like, here's a general theme of things that might be interesting to look at. And then the idea was by the third project, they'd be able to just ask and answer their own questions that they found interesting.

So, I did try and sort of ease folks into research by giving them like a little bit more structure in the beginning but then being, like, less structured at the end. Yeah. And I, I think you're very thoughtful about that. And you know, we just spoke with Anil, who's now at Google. Actually, this hasn't aired yet, but the episode will air soon. And Anil was one of your first Ph.D. students, right?

Mm-hmm. And I think the thought and the care that you put into sort of the arc of their career while sort of being hands off, but also so letting them grow, letting them develop independence. But also giving them a little bit of structure, semi-supervised so that they can succeed, I think is very clear. And it was very clear in what he said other than us both trolling you, of course, as necessary. Alright, so I want to dig into your work now at Lila. And so, okay, so let me try to frame this.

So, last year you went on leave from your Harvard professor job to become the CTO of a new company. I think the company was in stealth at the time, now out of stealth called Lila. And maybe we can start with your thought process behind the move. So, given what we're going through at Harvard right now, some would say you look like a genius. I know you have a very good crystal ball, Andy, but I think your decision to move preceded the current funding crisis. You were doing great academic work.

You're mentoring students, you're building a research vision around AI for neonatology with the superior, Dr. Beam, your wife. Why leave? Why move from Harvard? Yeah, it's a good question. Let me first preface by saying that it wasn't my first time to go into a startup. So, again, something that I owe to you is before I started my faculty job at HSPH, I took a year off to help start a company. And this is actually, again, advice from Raj. I had been interviewing for faculty jobs.

I had had an offer to join a startup from a company called Flagship. Flagship is a venture capital firm that instead of deploying capital in external companies, they used that capital to incubate and spin out companies. So, I'd been part of an incubation process at Flagship as a consultant. The thing that I had been consulting on got funding from Flagship and was gonna go get started as a company. It was centered on using machine learning for protein engineering.

So, can we use machine learning models to make protein therapeutics better, faster, cheaper, in a more targeted kind of way. Super interesting. Hadn't thought about protein engineering before but got to do that. And so, I was like, kind of torn. I was like, this is like a really interesting idea, but it's hard to turn down this faculty job. And, you know, to your credit you're like, why not both? Why not both? Uh, why not both?

And so, I actually delayed the start date of my faculty job for a year and joined what is now known as Generate:Biomedicines as the founding head of machine learning. Helped build the team. Helped build a lot of the early models. Helped build the strategy. Was there full-time for a year. Remained in a part-time capacity for four years after, I think my title was like professor in residence. So, I got to do the fun, like do the startup thing for one day a week

and then the professor thing for the other four. Generate has gone on to be I think pretty successful. They have 300 people. They've raised something like a billion dollars to date. They have two drugs in clinical trials. And that to me is the most important validation of the technologies, that they actually have made real things that seem to work. So, I had a super pleasant experience. So, I think that experience was de-risking for me to go join Lila.

So, with that preface, let me answer your question. So, I was out on paternity leave in February of 2024. And it was, like, our second child by comparison was much easier. So, no COVID, no starting lab. I was also helping start Generate at the time our first was born and it was actually just kind of like a peaceful time in our lives and that gave me, like, kind of a chance to reflect.

Going back to our, like, earlier conversation about my motivations, AI has always been the thing that I was interested in. Health care has always been a super important and interesting domain, but has always kind of been the sandbox versus the thing that has been my primary motivation. I mentioned before I started my faculty job that it was a pre-GPT-3 world. We still hadn't seen the benefits of scale. It still seemed plausible that you could do frontier AI research in an academic setting.

And you know, in 2024 when I was reflecting, it became hard to make that case that you could do frontier AI research without significant resources. It could also be that, you know, I had done the faculty thing for five years and I was getting the startup itch again. And so, I started to ask around. I do remember some texts along the lines of, I can't believe how much fun it is to actually be able to code and to just spend some time. I think you were, you were doing some coding again, right?

I was actually, through that period, I was actually doing some woodworking too. The desk that I have now is also amazing. Amazing. Amazing. Yeah. So anyway, I started looking around and there was a company called FL97. So, Generate was FL57. That just means they give them like serial numbers at Flagship. So, FL97 is the 97th company that they've incubated. So, there were 40 in between Generate and —. Over the five-year period. Yeah, yeah, yeah, exactly.

Yeah. And so, not accidentally, two of my Ph.D. students were at FL97. And I had been advising FL97 for a little bit, so I kind of had an idea of what it was. But Flagship companies have these very interesting evolutionary trajectories where they start in one place and over time they tend to evolve and change and adapt, and then they end up somewhere potentially very different.

So, FL97, and I'll just call it Lila from here on out, started to converge on something that was really, really, really interesting and really, really compelling to me. And what I wanted to understand is Flagship is known for making biotechs, you know, is this going to be like another biotech AI company or is this like actually an AI-first kind of company?

Meaning like, is the primary goal of this company to create AI or is it to use AI in service of creating an asset, a molecule, something like that. And so, I got to go spend some time at Lila, got to meet more of the team that they had built, got to meet the leadership team, and just became convinced that this was a really exciting AI company. And I'll talk a little bit more about the thesis behind Lila that was going to be less constrained from a resource perspective.

So, we were going to commit significant resources to GPUs, serious resources to creating the data that you need to create new kinds of AI models. And it just, it felt like kind of the culmination of a lot of the different things that I had been thinking about over the last 10 years. And so, I always say that like I had a fantastic job, you know, I technically still do, I'm on leave, but like being a professor is a great job.

What's happened in the last three months, notwithstanding, as of March 2024, it was a great job. Had a wonderfully supportive department, wonderfully supportive school, great colleagues, world class students. And so, this wasn't that I was unhappy, it was just trying to be honest about if the kinds of problems that I want to work on were accessible to me in academia.

And I think when I was clear-eyed, it just became hard to argue that I could work on the problems that I wanted to in an academic setting. So, the message I'm hearing is that our ability in academia to retain Andy Beam scales with the number of GPUs that we have access to. It's another scaling law for talent. But, so you're, okay. So, from that description, I understand that you're focused on AI first, which means not applications of AI or not just applications of AI, but AI itself.

and that you need a lot of compute. You need a lot of GPUs to accomplish your mission. And maybe you can tell us what that mission is, right? What are you trying to accomplish? Where are you and sort of where do you see this going for the next couple of years? Yeah. And just to preface or circle back on that last point, there are interesting problems that you can work on in academia, everyone has their own utility function.

So, I'm not saying that there's nothing interesting happening in academia. It just happens to be the ones that I find interesting are hard to work on in an academic context. So, what, so then what are we doing at Lila? We recognize that the scaling paradigms of the last five years have been enormously successful. Again, talking about passing the USMLE, as a sort of an accident of these scaling paradigms, but they're probably also saturating.

So, I think it's clear that pre-training or maybe the scaling law still works. So, power laws are kind of a hell of a thing that to get the same amount of benefit you still have to scale the compute by an order of magnitude. So, it might just be that we can't keep scaling it up to 10 million GPUs, and so, people are looking for new scaling paradigms. We think that models need the ability to essentially generate their own tokens.

And so, the models need the ability to ask and answer questions that people have not asked before. Large language models, one way to think about them is that they're a wonderful index into human knowledge, so everything people have created is accessible to a large language model. They're able to access it in this very fuzzy kind of way where they can do fuzzy pattern matching. And they're really great at accessing human knowledge.

Again, putting my causal inference hat back on though, we know that there's limits to what you can do with what amounts to a big pile of observational data. So, if you actually want to make a claim about how the world works, the best thing that a model with observational data can do is tell you kind of like what hypotheses are compatible with the data that it has seen before.

And the only way to essentially pick from a set of hypotheses is to either make strong assumptions like we would do in causal inference, or actually do the experiment. And so, that sort of key insight is what we're developing at Lila is how do we take these very powerful, large language models that have been trained on the entire Internet and pair them with a scalable experimental platform that will let them break ties that exist in the literature.

And ask questions that have never been asked before in the literature. So, again, like you and I both know this, I mean, what you did during your postdoc was really focused on this. The scientific literature is not a record of facts. It's a record of a debate under varying incentive structures. So, people are incentivized to publish the most charitable version of their findings. They're incentivized to downplay things that are inconsistent with the hypothesis that they're trying to support.

And then there will be papers published that sort of rebut that. So, I think it's obvious to me that you're not gonna be able to derive what science is happening in 2050 if you have just read those papers. That you're gonna have to do incremental experimental steps that builds upon what has been done in the literature. But we're not gonna have some oracle, GPT-6 is not gonna be some oracle that can just reason from first principles conditioned on what we know currently in science.

And so, if you buy that sort of, like, basic premise, then to your sort of immediate conclusion is, okay, how do we connect this with a scalable, experimental platform so that the model can push beyond what we know now? So, that's in essence what we're building at Lila, where half of the house is focused on scalable experiments. The other half of the house is focused on AI. But again, we view the experimental platform as a new token generator for the models that we're training.

So, are these robots in a room? What is the experimental side? Yeah. They are robots in a room. They are disembodied robot arms in a room. We have a system. So, we have an automated experimental platform where if you're familiar with how experiments work, they often work on plates. So, either like a 96 well plate or a 384 well plate, these plates magnetically levitate over this plan or motor system that we have, and they can zip next to this big rail.

There are benches with experimental equipment on it, and the robot arm will pick the plate up off the rail, put it in the piece of equipment when it's done, put it back on the rail, and then the plate can zip off to the next stop. So, the abstraction that I have for this is that actually we're building this new kind of computer. And that this planer motor system, this rail, is essentially like a PCI bus. And what we're doing is hooking new devices on this generalized PCI bus in the real world.

And the idea is not to have a couple of these stations that can do what we can do, it's to have buildings of these stations that can do experimentation at scale. And then it really does start to feel like a new kind of experimental cluster that we can pair with a traditional GPU cluster. Do you think, um, actually, do you like the characterization, because it occurs to me that it kind of feels like you're looking for a new scaling law or you're searching for a new scaling law.

Do you agree with that? Do you like that characterization? Is that fair or not? Literally, exactly how I describe it. Okay. Amazing. I've probably heard you say that to me and I'm just regurgitating stuff. Yeah. Literally, exactly how I do it. Yeah. And yeah, again, like it's just built on the recognition that, we've seen this a lot in large language models over the last three months. Like, they rely on verifiers and for some class of verification tasks, nature has to be the verifier.

And so, we're building a big, scalable, nature-based verifier so that these models can learn to hypothesize and reason about things that we don't really understand yet. And we think that that will unlock a new scaling paradigm in the same way that Pure compute trained on Internet data unlocked the first scaling paradigm. You know, just to sort of rephrase, like science is subject to the Bitter lesson, and we're trying to figure out in what ways it is subject to the Bitter lesson.

So, one of the other things you said, and sort of the motivation for what you're doing is that our existing paradigm, our existing large language models, you know, they can do so many things, right? So many things that and I think the word you used, or you might've used was byproducts, right? Mm-hmm.

They're almost, you know, there's no intent by the creators of these auto complete models that they'd be able to solve differential diagnoses that are very tricky or pass the USMLE's or other things, right? This just emerged from the sort of scale of compute plus the other training that that was applied to the models. But in describing that existing paradigm, think you said that we are saturating

or that it's getting saturated. And I wonder, is that an empirical observation that you have of the sort of performance of these models over time? Or is it more of a sort of inevitability first principles, um, deduction that you're making from the way in which the models are trained and the procedure that goes into to creating them? Is it a, they're sort of saturating, like they're not getting better at the benchmarks, it can only get so much better.

Like, where's that sort of initial spark of LLMs will not be able to do X, Y, Z coming from? They were very bad at classes of benchmarks that required long-term reasoning and planning. an example of this is solving complicated math problems, solving complicated programming problems by and large, simply pre-trained models are never best in class at those things that have what people call test time, compute, or reasoning capabilities have taken over this. It's like the O series, right?

Like the O series from GPT or equivalents a again, like putting my host hat back on to explain to some folks who aren't as technical in this pre-training is simply predicting the next word. So you, you know, or the next token. You can do this with unstructured data. Reasoning models are trained when by giving feedback that indicates how good their solution was.

In some sense, pre-trained models are trained to predict the average response reasoning models are trained to produce the correct response. And so, the fact that we've already shifted from one paradigm in pre-training to reasoning slash test time compute, I think is, is a good base case for saying that pre-training has saturated. Alright.

And then one of the other points that you brought up that I'd also like you to just talk about a little bit more kind of reflects, I think, or resembles some of our conversation, uh, with Vijay Pande a couple episodes ago. And so, you know, he had this very successful academic career and then he transitioned to industry and to venture capital, right.

And I think he made a very compelling case for, despite himself moving, for why, uh, there are certain problems in academia that are likely only solvable within academia. And so, maybe my challenge for you, Andy, is like, can you still man the case of sorts while having yourself gone on leave and gone to industry and at Lila. Can you still man the case for staying in academia?

What are the types of problems that you should stay in academia, current funding crisis not withstanding, to be able to solve? Yeah, I think there's a couple answers to this. One is the classic, which are problems that have no immediate or obvious commercial value.

So, like AI is kind of the opposite of that now, which is why it's so resource intensive where there's a gold rush to commercialize all things AI so classes of problems that have no immediate commercial value and a more long-term horizon. Things totally are in scope and that would include a lot of theory both machine learning and other kinds of theory. I think a place in medical AI specifically that is uniquely well positioned for academics is evaluation.

So, like actually, doing the evaluation to see if AI results in patient benefit. NEJM AI, you, and Zak have been at the forefront of this, obviously. I think that there's a lot of perverse incentives for that once you get outside of academia. And so, having trusted auditors who can know whether or not the technology actually works is also obviously a great thing for academics to be working on that has like huge public health and patient benefit that goes along with it. Alright.

And maybe one last question before the lightning round. Oh no. Which I am so, so excited about, is you described Lila as having these sort of two different components, right? Like there's an experimental side robots that are moving plates around on these magnetic, what are...? Planar motor systems. Yeah, planar motor systems. Planar motor systems. That's the, that's the term. And then you have a sort of machine learning side, right? Mm-hmm.

That is developing models and training and doing computational work. What do you see as sort of the biggest challenges that you faced already at Lila and what is the sort of the, the sort of key task in the, you know, the thing that's keeping you up at night, maybe, to focus on for the next year or two and growing Lila and achieving your vision? Building stuff in the real world is hard, like actually building hardware.

I mean, this goes back to like early days of my life when I was an electrical engineer. And actually getting stuff to work in the real world is hard. And there are all these like edge cases, like moving these plates around. They have liquid in them, which means that they slosh, which means they could be slightly off. So, when the robot arm goes to pick it up, it's in a slightly different position. And so, there's like thousands of last mile challenges like that, that we're solving.

I think the like philosophical challenge though, is that all of automation is actually created for people. And so, one of the things that we're really focusing on is, like, what does automated experimentation look like when there are no people in the loop? The fact that I said that we put benches next to this planar motor system is a hint that these were actually still designed for people because people need a place to stand.

They need a place that needs to be approximately, like, shoulder height. And so, really there's like a second order set of challenges about how do we actually refactor a lot of these experimental workflows? If they're just gonna be run by an AI in the cloud and you actually don't have to have humans in the lab standing there, that's probably the biggest challenge. And we're making lots of progress on that. We spend a lot of time thinking about it.

But if I was thinking about, really the core challenge, it would be like, how do we rethink these things from first principles given that we're doing something that really hasn't been done before? On the AI side of things, it's all the traditional things. Like we're not on O2 anymore. For those folks at Harvard who use the computing cluster there, we're not using Slurm. We're doing very complicated training flows on Kubernetes clusters that have all these orchestration things.

We're scaling up to thousands of GPUs now, and just like training at scale is very difficult. We are building like a unique set of training capabilities that gives the model access to a wide set of tools to use. And actually, orchestrating all of that together is also pretty challenging. I feel relatively better about the sort of AI challenges versus the challenges posed by the real world, but I'm confident that we'll be able to solve both sets.

Are you finding that the folks that you recruit or the way in which you recruit is very, very different than your academic lab, or are there, are there some similarities? There are similarities. I would say that the, the mission of AI for science does a lot of the recruiting for me when I talk about, we are trying to get an AI that can run the entire wheel of science to come up with hypotheses, test them and then update its understanding based on the result.

That's a pretty compelling message to a lot of people. We also are recruiting with industry resources and compensation packages versus academic compensation packages. That also makes things easier.

I still think that we get a lot of the same phenotypes though, of people who are cross-trained, neither some hard science or medical science who are also very like deep on the technical side of AI and can really make, again, like the recurring theme on this podcast is having multiple sets of expertise live in the same brain. And, you know, we found that that phenotype has also been good for us and also finds the mission pretty attractive. Alright.

I think that's a great moment to transition to what— Oh boy. —I'm super, super excited for, which is the lightning round. Oh God. So, and Andy, you know all the rules. Uh, so let's, let's dive in. Are you ready for this? I'm not, but let's do it. Alright. So, this first one is for your brothers, Andy, who is the GOAT—that already, that already got you scared— who is the GOAT that is the greatest of all time of basketball? LeBron James or Michael Jordan.

Oh man. I, I, I feel like when you say greatest of all time, this is not just a statistical consideration, it's a cultural impact consideration. And I think by that, I'm gonna have to go M.J. I think that M.J. cha—. Wow. I think that M.J. changed basketball both globally and in the U.S. in a way that, like LeBron, while having a statistical claim to greatest of all time, I feel like doesn't have the cultural impact that Jordan had. I'm gonna disagree with you, but that's fine.

We can move on to the next question. What is the single biggest barrier preventing large language models from becoming trusted frontline decision support tools in clinical medicine? I'm gonna say that it's a mix of reliability. So, the obvious, like problems with hallucination. And that they still only represent a partial solution in a way that a person does not. So, this is getting better, but they can't use tools like if they have to pick up a phone and call someone, they still can't do that.

So, there are still like capability gaps that are unrelated to accuracy and reliability that I think still need to be filled before they could totally replace a lot of frontline decision-making services. Alright, our next question, which is the hardest job, and this is one of my favorite ones asked now since we, we've done it to Zak, and I think also to Larry Summers, but which is the hardest job: being tenure track faculty at Harvard, founding deputy editor of NEJM AI, or CTO of Lila.

Oh man, trying to get, trying to get me in trouble here, Raj. I'm gonna go with, I think, I don't, I feel like tenure track faculty just because it's not only the weight of your own ambitions, it's you're meeting people at this very vulnerable stage in their career. And I always felt like I internalized a lot of that if a paper gets rejected whatever, I have papers. But for students, those feel very like monumental decisions.

And so, I, I feel like that the rejection hit me harder for that reason than like day-to-day challenges in the other two jobs you mentioned, Great answer. And I think, again, reflecting how thoughtful you are as a mentor too, that you can separate sort of your perspective from your students. And I totally agree. It's a very, very important time and each thing feels, feels very, very important each outcome. So, that is challenging to navigate.

Alright. If you weren't in AI, what job would you be doing? Think outside the box here. Well, so, I can tell you what I said in kindergarten and in kindergarten. I told my mom that I either wanted to be a Ghostbuster or a trash trashman. Both noble professions, but I don't think that's what I would answer now. If I wasn't in AI, I actually think some kind of writer. I always liked writing in undergrad. I always liked writing essays. I blogged a little bit during my postdoc.

I think some type of like substack writer, something like that would be something that I would naturally enjoy. Yeah. Nice. I don't think I would've guessed that. So, I like it. Very great answer. Pro smash. The last professional smash rally. That's what, that's what I would've guessed. Yeah. Alright, next, next and our, our last question, if you could have dinner with one person, dead or alive, who would it be? I've also thought about this, and I have two answers.

'cause I knew this one was coming. The first one is just an intellectual one, and I think it would be David Foster Wallace. I've read every book he's ever written except for The Pale King. I've read multiple lots of his stuff over and over again, and I would just be dying to know what he thinks of the future that he largely predicted in a lot of his fiction and nonfiction. So, I think that would be it.

The sentimental answer is my grandmother, my nana. My mom's mom was the matriarch of our family. She died about 15 years ago and was always the one that would like, tell you exactly how it was. And like when you got Nana's approval, that was like the best approval that you could get because she was like a tough lady.

You know, a child of the depression, lived through two world wars, went to college at a time when a lot of women weren't going to college, and was just like a sort of the bedrock of our family. And so, I would just love to have dinner with her and kind of be like, so what do you think, Nana? And then she would tell me exactly what she thought, so. Nice. Well, congratulations Andy Beam. You have survived the lightning round. Passed it with flying colors a great job.

Alright, so Andy, I just have maybe one or two last questions here. More big picture, kind of some concluding thoughts that, words of wisdom that, that you can leave us with maybe. The first is, we talk a lot about this on the podcast and listeners will know that we like to invoke the scale hypothesis as a way to think about large language models. And we've already talked about it in the context of LLMs, but also in the context of the work that you're doing at Lila.

And maybe I can restrict this for the sake of this question just restricted to medicine and applications of language models in medicine. So, there's this sort of current state of the models, right? Like, if we were to just freeze time, freeze the technical capabilities of the models. And ask what they can do, what they'll be able to do within medicine.

We all have predictions for where they are, with respect to the things that we have to do in diagnosis and treatment and other applications in medicine. And then there's another version of this, which is, how will these models sort of continue to evolve? Will they continue to evolve? How much better will they get?

And my question is, can you open up your crystal ball for us again within medicine and just forecast, invoking, you know, where you need to, the scale hypothesis, what is going to happen with LLMs in medicine over the next, the next few years? Yeah. So, again, just to come full circle, I consider the class of problems that I was talking about when Kristyn was in med school and during postdoc to be solved.

So, the estimating, the correct conditional probability of disease given symptoms, even if the symptoms are expressed partially, you know, in an incomplete way, even if they need to be elicited from the patient for that problem to be largely solved. I think I'm going to put two classes of problems that we should think about when thinking about the scale hypothesis for health care. It's automating what we already know how to do and then doing things that we don't know how to do yet.

So, I would again say diagnosis is automating things that we already know how to do. My pediatrician missing the whooping cough thing, like we, someone knew how to do that. He just happened to get it wrong. The big thing that will change health care over the next one to three years is generalized computer use.

So, we've seen tools like this already from operator, from OpenAI, from Claude agents, but the ability for AI to reliable use a mouse and keyboard solves probably like 90% of the remaining unsolved problems in AI because you can just have it sit on a workstation, enter orders. And the question that you asked a while back about what stops it from being a frontline decision tool, I think that's solved when AI can use a mouse and a keyboard reliably to do long-time horizon tasks.

So, the sort of like operational aspects of medicine, health care that AI can't currently do, I think will be solved by continuing to scale what they're currently doing with computer use. Go ahead. Would you include being able to, uh, operate ScholarOne that might be a GI hard list of tasks? You know, like, just like, like envy, envy hard. Yeah. Yeah. The, the final frontier. The final frontier, the operating, out-of-date web software that was written in the 90s over a weekend.

So, um, yeah, so I think that generalized computer use. It would probably one or two orders of magnitude of computing power away from making that reliable. But I'm guessing that will be solved over the next year. And when generalized computer use is solved, just like step one was solved as a byproduct, many other of these operational tasks will also be solved as a byproduct. So, I imagine that being like, and I know a lot of the Frontier Labs are pushing pretty hard on that.

Then there's like the unknown, unknowns. So, like there are some diseases that we don't know how to diagnose that we don't know how to treat, that we don't even actually know how to classify. There are, you know, many things in public health and medicine generally that are just kind of like dark matter.

I think those are gonna have to be unlocked not by scaling, but something more akin to what we're doing at Lila and what other people are doing, where we're actually just using AI to make science go faster. I think that like that is gonna be more of a five-to-10 year time horizon. We're gonna need new measurement devices.

So, you know, I know that you know this deeply, Raj, but the resolution that you have on a patient's physiology from the electronic health record is like the difference between, black and white television in the 1940s. What we actually need is like a hundred foot 8k a picture. Now we just don't have that. So, we don't have high resolution characterizations of patient physiology and we'll need new devices that will enable that.

I always think about one of the things that we worked on but never finished in the lab was the ability to do non-invasive measurements via visible light spectroscopy. So maybe that's not the right technology, but some other sort of like mass characterization of patient physiology that you can then feed to the ais that are being trained in the current scaling paradigm feels like the next

like big unlock. AI making science go faster will indirectly make medicine and health care better over like a five to 10 year period, but it's hard to know exactly how that's gonna play out. Going back to one of your earlier answers, do you think that evaluation is gonna remain sort of the critical frontier critical academic task for, for the next few years. Yeah, I think so. I think it has to just 'cause there's gonna be a lot of stuff coming online. Integration and implementation science, too.

Like how do you either retrofit epic with this stuff or do a gut reno so that you can get this type of technology in. That also feels like a super necessary thing to have happen. Another area that I think is also gonna take off.

It's, of course, a very old discipline, but I think human-computer interaction and how humans and machines will work together, is also sort of poised to really become very, very important for AI and medicine and for actually getting these tools safely and, and effectively into the clinic in the next couple years. Alright. Last, last question, Andy. So, we both give a lot of talks about AI to doctor audiences, to various academic grand rounds kind of settings.

And one of the questions I get asked the most in these settings is, physicians come up afterwards and they say, man, this is moving so fast. You know, great to hear your talk, but what should I study? What can I. Arm myself with what can I learn so that I'm ready for this in the next couple of years.

And then this other sort of skepticism, which I honestly really like, which is like, okay, some of the stuff you showed was cool, but there's a lot of like hype and there's a lot of like, you know, BS that's out there. How do I tell what's real, what's not real?

And so, maybe thinking about the physician, the clinician listeners in the audience what's your advice for staying up to date, other than listening to this podcast, but staying up to date with AI for providers, for clinicians specifically. Pick one of the frontier models and use it every day of your life. So, pay the 20 bucks for ChatGPT. Pay the 20 bucks for Claude. Pick one. Pick both, but use it for tasks. And see where it breaks.

So, if you're gonna make a taco recipe, ask ChatGPT for a taco recipe. If you're looking for things to do on your next vacation, ask the model how you would do that. If you're trying to generate an image for a talk, use the image generation capabilities in these models to do it.

I think that there's no single source of pedagogy that's gonna be helpful here, because the technology moves so fast and you're gonna get a sense of what it can do and what it can't do just by getting the muscle memory of using it for tasks in your everyday life. So, I've had friends ask me this and I'm like, just use ChatGPT. Like if you think you can't use it.

Try and use it for the task that you wanna do, and either you'll learn that it actually can do that, or you'll learn that, okay, so, here's a blind spot in these models. You'll learn when it hallucinates, you'll learn when to trust it and when to not trust it. And then, yeah, you'll kind of get a sort of an intuitive sense of how the models work and when they don't work.

I think it's probably not like the best use of time to go read, like the attention is all you need paper or the, you know, the RLHF papers. I think it's much better to have like an intuitive-like folk understanding of how the models work and the best way to do that is just to practice every day with them and see when it breaks. Amazing. Alright, I think that's a great note to end on. And Andy just gotta say, this was fantastic.

I know a lot about you already, but I learned a lot more on this episode and thanks so much for coming on AI Grand Rounds. Yeah, highlight of my career. Thanks for having me on, Raj. This copyright podcast from the Massachusetts Medical Society may not be reproduced, distributed, or used for commercial purposes without prior permission of the Massachusetts Medical Society. For information on reusing NEJM Group podcasts, please visit the licensing and permissions page at the NEJM website.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android