¶ Intro / Opening
Hello, I'm Andrew Main, and this is the OpenAI Podcast. Today we're talking to Dr. Nate Gross, head of health in Karan Singal, who leads health AI research at OpenAI. We'll cover what went into training models to handle sensitive questions and how it's helping clinicians, patients, and healthcare systems.
We actually worked really closely with with a group, uh a cohort of around 250 physicians across every stage of of generation of this data. And we're starting to see medications that have been sitting on a shelf. That all of a sudden AI has found ways for them to have How did you find your way into healthcare?
¶ Origins of Nate and Karan's interest in AI and healthcare
So uh what what drew me to healthcare initially was uh health policy. Um I was was very interested. This was Before the first Obama election, uh, value-based care was first becoming a thing. Um, I I started studying uh different ways to make healthcare more accessible to more people. And then um eventually went to uh Emory for for medical school. And and what what drew me to that was a a large uh public hospital, Grady hospital.
You know, to make sure that you're you're taking advantage of every clinical hour you have. So what kind of things were you doing? So I was mostly pissing off the IT department. Um when when I was in medical school, uh the news feed came out, the iPhone came out. Uh Twitter came out, uh the app store came out.
And and so comparing the technology that we had as doctors, which was fax machine, clipboard, paper binder, the beginnings of electronic health records, to like what my friends had or what the patients had in the waiting room was pretty profound. So you come at it from the point of view as an AI researcher. Where did your interest in applying this to healthcare come from? So I I nerd out a lot when I was younger about things like philosophy of mind and I thought a lot about
you know, intelligence and how far could intelligence go and could machines be intelligent. And um a lot of those explorations took me towards as I was learning about AI and starting to work on my first AI projects. thinking a lot about the ways in which AI could have a lot of impact on humanity in the future. And I thought something like I didn't I didn't predict the future or how fast it would happen, but I thought something like AGI would happen within our lifetimes.
So then once once I had that conviction, I thought a lot about, you know, what are the ways in which I can have either positive impact and and hopefully make that a really large upside for humanity or think about the ways in which we could avoid downside. So since then in my career, I've been thinking a lot about both sides of that coin, thinking about that from the perspective as a safety researcher, which is part of my background.
And then really some of that work on safety and privacy that I was working on previously, I started applying it in healthcare. And then I started being like, whoa, there's a really massive opportunity to think about the application of this technology, especially large language models in healthcare. And that's what took me to transitioning to it full time was just the size of that opportunity and the fact that I felt like the healthcare and clinical AI world was kind of
not fully aware of that of that gap. And so I just thought it was kind of a really amazing opportunity and responsibility to to bring us there. I want to understand both the vision and actually how this is going to be implemented. So our mission at OpenAI is to ensure that AGI benefits all of humanity. And health is one of the places where I think that is not only most achievable, but is the clearest. So healthcare today, as every one knows, is fragmented.
Care is missed left and right. Patients are often left 364 days per year without the opportunity to engage with the organizations that have the information centralized. And doctors uh have extremely limited time when they do get that chance to engage with the patient to actually have uh, you know, a a meaningful impact beyond a simple surgery or a simple reactive prescription, you know, uh the the system is more reactive than it is proactive today. And that leads to uh, you know, tremendous
challenges in the system. It leads to tremendous gaps in care. It leads to, you know, leaving people behind in situations when they could be thriving. And Uh one of the reasons that I joined uh OpenAI is is because access has always been a a through line uh in in my life. Access to knowledge.
First in medicine, then in building a a product for uh doctors to access the latest medical literature, and then in supporting entrepreneurs as they were building healthcare tools. But OpenAI has the type of technology that can do that at scale for the entire ecosystem all at once, help patients, help healthcare professionals. and uh help incredible entrepreneurs who are building for all of the corners and edge cases and tough challenges that exist in in each area of the the health market.
What is the strategy here? We know that people use chatbots all the time now for medical questions, but it seems like you're building and working towards something bigger and more comprehensive, not just for the patient side, but the clinician side.
¶ Strategy for building AI tools for clinicians
what your goals are. Patients are increasingly turning to tools like ChatGPT throughout the year. In fact, um 900 million people now use ChatGPT per week. And if you look at how many are are doing health-related queries. It's about one in four during a given week. So that's forty million people per day. And so our strategy in health is as much proactive as it is reactive and stepping up to the responsibility uh and and the opportunity to do good that that comes with that strong consumer demand.
And so uh with Chat GPT Health we have created a a space to keep these conversations not just secure but empowered. Um so when I say secure, of course uh in encrypted with this essentially one-way valve protecting your conversations. So these extra security layers, these protections to make sure that we will never train on on users' healthcare data. Combined with empowerment, really, um, you know, search engines that that people have used before to navigate health.
have amnesia. Um you know they're they're one size fits all and uh I I think context really matters in healthcare and so um building a a a series of of features and and technology hooks to help patients bring in their own context that they choose to so that each time they choose to engage with AI, it's grounded in their own context. Uh is is it key reason why we've built this Chat GPT for Health Foundation.
So I understand the safeguards you put in place to keep the data separate and to make sure that you don't get leaked between there and to be able to, you know, undergo uh a very rigorous method of making sure that your data's secure. But when it comes to the model itself.
¶ How AI models are trained for health use cases
what comes into training models that are capable of working with something like healthcare. It's kind of like the most important thing in the world. For sure. It's it's a high stakes domain and because of the use of that that people are doing, it's super important that we get it right.
So we think a lot about a few things when we think about evaluation and training for healthcare. And this is actually the foundation for the work at health at OpenAI. Um when we were first starting to work on the health effort at OpenAI, we were thinking a lot about the safety and grounding motivation as an important part of what we were doing. And so part of the thesis actually for starting work on health at OpenAI.
was thinking this is an excellent way to ground our work in safety and alignment and provide kind of concrete incentives and feedback loop for researchers who think about this problem. Um so th is like the model improvements and the safety safety thinking here is not just an afterthought. It's actually the the beginning of our work here. And so where we started really was thinking about evaluation. Um so can we
Can we think about the ways in which, you know, models were already starting to become useful to people then? And there's already starting to be this. capability overhang between what the models could do and what people were using them for. And so we were started to navigate that problem and think about, you know, where do the models still have gaps today? And and so that's where our work on evaluation comes in.
And so we've taken a pretty method methodological, um methodologically interesting um approach to that. And a lot of that has reflected in our work in Healthbench, which is this kind of Uh realistic evaluation of conversations between users who are either health professionals or consumers talking to models. Um, and seeing uh measuring the performance and safety of the models in in these situations, which are these kind of multi turn conversations.
And the way we worked on this is we we actually worked really closely with with a group, uh a cohort of around two hundred and fifty physicians that we worked with to kind of um across every stage of of generation of this data from um thinking about the ways in which
um, you know, the areas that we would focus in for the evaluation and the areas that that we um thought about were were gonna be clinically relevant or impactful to the specific, you know, what are the specific things that are being graded in this evaluation.
So that's like a range of things from, you know, are you tailoring your response to a layperson versus a more technical health professional? Are you um thinking about the ways in which um you should see context first before providing an initial response? Um the models used to be um significantly are are much better today at at kind of seeking context when needed. Uh because users are typing in, you know, much less than the models often need to be able to provide
um information that's most helpful. It burns. Exactly. You know, if if a user types in, it burns. You know, how how how do you think about the right way to provide information? You can provide some initial information potentially based on a uh you know impression you might have of what the user might be saying, but the most helpful thing to do in that situation and the safest thing to do in that situation is actually to ask for more context.
And so that's that's just one example of of the many ways that we kind of measured performance in Healthbench. And Healthbench in particular actually measured around 49,000 different dimensions of performance. And that's that's just an example of one possible dimension of performance.
So this is a very multifaceted evaluation that we built kind of in concert with this cohort of 250 physicians over a long period of time. And it took us about a year actually end to end to to work on that evaluation and then release it. And the kind of the the model
¶ How OpenAI is able to score well on health evals
development cycle. It seems like sometimes some company gets a bit ahead and somebody comes up and catches up and whatnot. I've noticed a pattern with the open health models. They've consistently been far ahead in health bench and other evals that like by a big margin. Why is that? I th I think we have a pretty dedicated effort here and a pretty serious effort that is cross-functional and and and kind of um across the stack. Um for everything from kind of pre-deployment evals.
to um like like healthbench to monitoring in production traffic and thinking about the ways in which we are ensuring safety in in production traffic in in a privacy preserving way and working with physicians across every step of that process. And so To my knowledge, um, OpenAI's models are the only major models where every phase of model training, from pre-training to to mid-training to post-training, and every step in between really integrates health into every major stage.
Um, and I think the result is that our our models are pretty good, not just on our own benchmarks, but also the benchmarks that people other people put together. I'd like to add a little to uh what Karin said about the the the model training, because I I think when we spend time with the healthcare ecosystem, that's one of the things that is most important to them. So Not only were these models uh
trained in development with hundreds of physicians who created over five thousand conversations and forty eight thousand five hundred rubric criteria through which to uh evaluate uh AI responses and and score them and identify ways that we could improve the model. uh do additional data acquisition, uh do additional post training, hone in on a particular subspecialty or particular area of the world where users were telling us we could improve uh
health or health care in in that specific topic. But in addition, uh I I think that close proximity to physicians Really leads to calling out the the most important parts that should be focused on in model development. So uh You know, other places sometimes I see how a model fared on a a medical school exam or a board exam.
And healthcare is is not multiple choice. You know, patients are are coming in with a tremendous amount of complexities and their own stories and and nuance and context. Um, and that's presented in many different ways. And part of the job of of working in healthcare is being able to draw from those disparate sources, uh draw from experience, balance all that in your head. And so having a training mechanism that thinks about things like
when to escalate and how to escalate and keep that always as the the top priority or adaptive literacy. I mean can compare the the one size fits all handouts that that people get when they visit the doctor today. To a model that can respond differently when it knows you're an oncologist versus a primary care doctor versus a pharmacist in Kenya. versus a a a patient at the twelfth grade literacy level or the third grade literacy level.
is extremely important for not only making sure that um accuracy and and impact uh is is maximized, but also just to make sure that everyone can maximally participate in their own care on the patient side. And then finally uh
Uncertainty. You know, if you go back a a year and a half ago, many of the the mistakes people would call out about AI models were overconfident hallucinations. And I think in in such a high-stakes field like healthcare, One of the most in important things is that the model can be trained to better know when it doesn't know. And say that.
And and in addition, suggest follow-up that can be dug into either uh by the patient in a referral to the healthcare system or by the doctor if the doctor is using the model, a test that they might run, additional uh pathways they may go down to make sure that uh the the patient can be led to the the best possible outcome. We've seen The cost of intelligence drop every year and it's exciting because every year you're able to get better answers, medicine
¶ Key challenges deploying AI in healthcare
everything healthcare across the board. But what are the challenges? What are going to be the blockers or what are you looking at ahead to say that, okay, we have to solve for this? The the drop in cost intelligence has been super exciting here because so much of what we think about and care about here is actually about access. Um and so the more people have access um to technology, the more, the more people will benefit. And that's why we're
um uh working on rolling out chat health more widely to all free users. Um and So that that's incredibly exciting. Another thing that we think about as researchers is like where will the marginal gains intelligence compound the most, right? And so I I think Nate mentioned this exciting thing, which is like there is more and more data that is being collected that is across different modalities.
How do you think about integrating that data across all the different ways that people use ChatGPT and all the different um modalities and wearables and things like this that people are collecting, lab tests, things like this?
And that's one place where I think a lot of the intelligence will compound and we'll start to see kind of new zero one capabilities, like a model looks at my entire history over a decade and tells me a prediction that even a human couldn't have because it's just the model has a higher context size. Um so thinking about those zero to one capabilities I think are are gonna be really cool.
The other thing we keep in mind is just like how are people thinking about and using ChatGPT today? Um, can we can we measure that? Can we improve that? Um and and I think we're kind of this this interesting point right now. I I call this onto our team the transition, where You know, for context, I I bike to work. Um, and I bike to work, I wear my helmet, I I I worry about cars and things like this next to me.
I just reached the point here in SF. You know, S in SF we have a a bunch of self driving cars, including Waymo's. I just reached the point where, you know, when I'm when I'm biking next to Waymo, I actually feel safer than if I was biking next to a human driver, right? I don't worry about whether I'm in their blind spot or or not or anything like this.
So I feel this protective effect by being next to the Swaymo. And I want everybody to have this protective effect, right? I I want everybody to have this protective effect with with health AI. There are these studies showing that, you know, if you have a doctor in your family, that adds a protective effect to your health as well. Um and I want everybody, whether they're patient or a health professional.
to think about the ways in which um the the like y as a patient you want to feel safer um having this. As a health professional, you want this to be a safety net for for the decisions that you're making. So that that's another frontier that I think we're gonna cross in the next six months or so, which is really exciting. Um this kind of inflection point.
Um another thing that we're thinking about is kind of like the right ways to think around um post-deployment monitoring of certain workflows. And and I think a good example here that I'd love to talk about is. our AI clinical co-pilot um study that we did um with PentaHealth. Um this was a study where we worked with these 20 or so clinics in Nairobi and actually thought about the ways in which we can deploy a safety net for clinicians in that context.
which is basically um monitoring uh the things that they type into their electronic health record and only interrupting their flow when um that there's something um potentially concerning that's going on or a potential error or things like this. And what we found is that when we deployed this to clinicians um in the setting, that there was actually a statistically significant reduction in um diagnostic and treatment errors for the clinicians who are using this tool versus not. And I think
This is a step in the direction of moving beyond kind of model evaluations and even um monitoring of the ways in which people are thinking about using ChatGPT today to actually like thinking about workflows in which these technologies can be deployed and the right ways to evaluate those workflows after deployment. I think that's another frontier that we are are really excited about and would love to see more from our partners. Nate, what do you think the challenges are going to be?
I'll start with talking through some of the the challenges that exist uh on the professional side. So uh When healthcare professionals use AI, they're looking for the ability to trust what they're seeing in the answer. And so a lot of our recent work has been making sure that answers that the AI is providing are not just grounded in what the model was training on. But is ground are grounded in the latest medical literature, the latest guidelines.
And sometimes the latest guidance from their own institution or their own region. Uh some conditions are treated differently in areas of different areas of the country. Uh other times uh different care settings have different levels of of resources, different levels of
uh specialists and additional services on hand and um it can be helpful as a uh healthcare professional to be able to quickly navigate that and come up with completely personalized care plans. And so Building connectivity within ChatGPT to not only be uh HIPAA-aligned and be used in these secure environments.
but also be able to combine sensitive information with the latest medical knowledge, uh, I think is uh a great path that we've started down and and something that will, you know, continue to keep trust as the top priority between how healthcare professionals engage with AI. So I think one of the other challenges um is that uh the systems themselves in healthcare are quite siloed, both at an organization level, but also at the tools that have to be used within each organization.
AI thus far has been deployed on a a really a point solution basis uh in the technology industry. Uh increasingly the uh connectivity is becoming available to connect the dots between the hundreds of different systems. some analog, some digital, some structured, some unstructured, um, many decentralized, many not on the cloud, uh, being able to Connect all of those through unified AI layers to actually make sure that um patients and information isn't
you know, falling through the cracks and that the connectivity can be maximized to to actually bring the greatest amount of of impact. That's hard in healthcare and it's it's certainly not something that we can say is solved. But with many of our recent products, ranging from Chat GPT for healthcare and its uh connectivity to apps and connectors. to the OpenAI API for healthcare, to our frontier foundation for models and agents.
uh we think increasingly there's going to be an opportunity to um really accelerate what is possible within the healthcare system and what agents can achieve.
¶ Collaboration with hospitals and healthcare systems
Part of this seems like it's very collaborative working at the healthcare industry. And I noticed when using the Chat GPT Health app, the first thing I did was able to put in my records and get all of that. And it looked like there was a lot of just cooperation working across the sort of ecosystem to do this. How has that come to be? Where is this headed?
It's extremely important that um all of the healthcare system has an equal chance to uh contribute and engage um nationally uh and internationally uh with providing the context that will help empower uh patients to receive the best possible answers from from chat GPT and so Uh on the electronic health record side, this means working with the government and centers for Medicare and Medicaid services.
uh adopting national standards for electronic health record syncing so that patients in in just a a few taps are able to um bring in their context in consented ways. Um it's being able to tap into um existing standards like like mobile phones and the most popular consumer health products and uh the most popular biosensors and wearables to make sure again in in just one or two taps patients are able to not only bring in that information,
but but leverage it in in thoughtful ways in ways that may not have been possible without the combined uh set of data that that can exist in this sort of ecosystem. So for instance, Being able to uh reference your recent exercise activity when making a uh
plan of how to spend your evening or being able to even do things as uh as simple as, you know, reference your overnight sleep and and stress when your agent is helping you set your calendar for the next day and what tasks you may take on first. It's it's very exciting. Uh you know, I have, you know, we're a smart ring, a watch, whatever, but I get this data and all I kind of have in my apps are rings to look at and go like, Okay, I guess it's doing something.
¶ Practical everyday uses of AI health assistants
Being able to plug in a chat GPD has been fantastic because now I'm able to ask those kinds of questions. But that's very exciting. What you talk about too is if you get a plan from your doctor or suggestions is literally say, hey, uh I didn't walk enough yesterday. What should I do today? I've had to be really good at menu planning and literally go onto this menu, film tell me what to order and whatnot. And so you're saying we're just gonna get more of that and much better.
Yeah, and that that's why our partnerships I believe are are so important, because uh in these instances, Chat GPT doesn't replace uh the incredible technology that our partners are building to go deep. on uh health insights for a a particular wearable. But but our our surface area, our opportunity to bring in that health information can now extend to the many different ways people use uh Chat GPT, such as
what they're gonna cook for dinner or how they're gonna plan their their afternoon. Uh you know, sometimes I think of two patients and and one patient has to navigate the healthcare system by themselves. And the other patient Maybe he has a spouse come with them. has a clipboard and used to work as as a healthcare professional and is
very attentive, if not neurotic, and can follow up on on details and is connected to your personal calendar. And and you know, the best aspects of of that, uh uh, with consent for the patients that that want to, I think represents a A future where we can make it easier and easier for patients to follow care plans to play active
captain-like roles in in their own health, in in partnership with their their care teams and their physicians. And I think if we we can remove a lot of the the friction that historically exists between those processes, whether it's uh just information not following or there's a lot to keep track of or a lot of old information to parse and bring in, uh, we can do a a tremendous amount of of good or or we can help.
patients themselves be empowered to do a tremendous amount of good in their own care plan. And you know as a physician that it's hard to give as much time as you would like because you're gonna always have more patients you have to deal with than you have hours in the day. And it's interesting to see kind of a technology that has infinite time, infinite patience to be able to do that as a complement to that.
I mean, if there's one thing that healthcare professionals are short on, it's time. Uh so when we um uh think about our our role internally at OpenAI, we often break down uh the work that we're doing into three buckets. uh raise the floor. So make sure that AI and the benefits of AI are accessible to everyone. And that could be patients, that could be healthcare professionals and others working in health-related industries.
Sweep the floor, uh, which means help doctors uh and help other health professionals. Save time from the tremendous administration uh and bureaucratic burdens that they have every day so that they can spend more time with their patients. Uh and then thirdly, raise the ceiling, you know, the the uh impact that AI can have in healthcare, I I think will uh you know allow us to look back on this this space in a few years and say, wow, we we have all accelerated together in a way that
medicine is still in the in the driver's seat, but uh is also far more in power than ever before. Yeah. I don't think anybody feels like their doctor spent too much time with them. So it looks like this is gonna be helpful to solve for that.
¶ Biggest "wow" moment during development
What was been your favorite aha or wow or this is really cool moment in the intersection of AI and healthcare? I'll answer your question in a non-standard way, which is I I I think the most amazing thing to see for me in the last year has been the rate of adoption of health, actually even even beyond the the ChatGPT Health product before we we announced the ChatGPT Health product.
Um, it's been one of our fastest growing use cases, is kind of health and wellness questions. And we we share that hundreds of millions of people a week are starting to use ChatGPT for health and wellness. I think seeing that rapid growth, especially, you know, coming from a background of of
being motivated to work on this problem because I felt like healthcare and clinical AI world were not super aware of the potential of LMs in healthcare and seeing how far we had come, I think has been a really special moment for me. There's no doubt that the adoption of this technology and the and the fact that Uh it is increasingly collaborative with the healthcare system. It is increasingly driving feedback loops back to us to improve the models.
is the most meaningful thing and the most mission aligned thing. But what I also get excited about is is what our research team is increasingly be able uh able to give back to them using that feedback. And Not only is it the capabilities of the models. But it's what can be unlocked once those models are allowed to run longer and have more context. And we're starting to see discoveries of medications that have been sitting on a shelf.
that all of a sudden AI has found ways for them to have meaningful and uh direct value in in in patient lives. It is starting to scale experiments that uh you know, we as individuals wouldn't have been able to juggle on our own. Um and that that partnership combined with that increased capability to finally move from being interesting to being useful and increasingly to being transformative. is I think what is the most exciting thing for us heading into this year.
Now that you've been working on this for some time, you've been engaging with clinicians and talking to people helping deploy this. What has been some of the feedback you've seen? I think I think the the experience of flying to Nairobi and seeing the clinicians using the tool and the ways in which
¶ Feedback from clinicians and early users
Um, we did this thing which we call active change management, where we work really closely with these clinicians and uh flew to Kenya a couple of times to think about the ways that we could deepen their workflows using the AI tool um and make it something that not only made sense to them but actually became a kind of something that was
um indispensable for them. And so as we were concluding the study, um the team was actually thinking about uh the the team at Penta Health was thinking about uh potentially running another study. Um and they actually had a lot of hesitance around running another study because that would have involved having
Some group of clinicians using AI and some group of clinicians not using AI, they actually felt that it was dangerous to have a group of clinicians not using the AI. And so that's the point at which.
I was like, wow, we have we have done something major here. I think the the stories that we get back from our members every day are are one of the most meaningful parts of the job. And these are from caregivers that are uh, you know, increasingly under strain, taking care of of family members, trying to navigate their own health at the same time.
This is from uh doctors and nurses who are truly overloaded every day and um we can help them extend their expertise uh and you know, compress the the tough parts of their their day a little bit more.
And then sometimes and and and this is more rare but increasing, it's the miracle cases. It's the the patient who had been bouncing around the system for years, the the unsolved diagnosis, the uh the emergency where information wasn't present and suddenly being able to step in and assist and accelerate and bring people into the care that could really help uh is uh
Truly a privilege. It's exciting. It it's is an amplifier and every doctor I know wants to be able to do more for their patients. Thank you very much. This has been very interesting, guys. Thank you.
