[AUDIO LOGO]
How do clinical trials really work? And more importantly, do they work for everyone? Are you being given medical advice really meant for someone else? We touch on this and more today on Data Nation from MIT'S Institute of Data Systems and Societies. I'm Liberty Vittert. Today, my co-host, Munther Dahleh, the founding director of MIT'S Institute for Data Systems and Society, and I are speaking with Regina Barzilay.
So I wanted to kick us off by talking a little bit about your historical trajectory. When you came to MIT, you were an expert in NLP. And now, you're doing all this work in cancer, and in general, and in health. And maybe you can take us a little bit into your trajectory and what got you there. And I know that you had your own personal struggles with that, so we'd love to hear more about that. So thank you. You did know me when I started at MIT 20 years ago.
So when I came to MIT, I was working primarily on natural language processing. And at the time, natural language processing was not what it is today. That now everybody know what is ChatGPT, and used Google translation, and speech recognition, and all the chat bots. At the time when I entered MIT, it was still a really young budding field. Translation was not something that people used.
The tools were only known to experts, and most of the time when I say I'm doing natural language processing, I had to explain what does it mean because people really couldn't visualize any tools in this area. And I was really fortunate to be part of growth of this field, which moved from this very kind of experimental laboratory science to something that became a commodity that is used today in so many products, and academic research, in our personal life, in so many industries.
And it was really exciting time. But at around 2014, so I personally became sick with breast cancer. And I remember clearly that one of the big surprises for the first time being really sick is discovering that when you go to a hospital, you actually don't see any machine learning in any form, even data science, or even barely statistics.
None of what you see, even though I was still at MGH, which is just one subway stop away from MIT, none of this great technology is not part of patient experience. And when you are treated or your loved one is treated for some really serious diseases, information is really key because you need to make various decisions. And there is no clear answer, and you really want to know what happened to patients like me. What is the likely side effect that I'm going to get-- all these type of questions.
You cannot get answers. You still cannot get answers to many of these questions. But at the time, it was really eye opening for me. And when I came back to MIT, my first inclination was that I absolutely have to change it. And that's how I did my translation into life sciences, and I've been working in this field since 2015, for the eight years. And today, I work primarily in drug discovery and molecular modeling, but I still do some clinical AI, like imaging and other things.
I think that's such an interesting way to dig into so many important topics that you're working on. And one that I was sort of fascinated by was something that I feel like people have talked about, but a lot of people don't really understand in the same way as what is NLP, or what are these large language models. And it's that people talk about how there's a lack of diversity in data sets that are training. These AI or ML models or whatever you want to call it.
And could you give some real life examples, even from your own experience, of what that even means? What are the consequences of having a lack of diversity in these data sets? So the lack of diversity-- I would say in majority of areas of clinical AI, before we even start talking about diversity, we really lack data sets to start with. So even if you're looking at the most basic areas-- let's say you want to look at mammograms. You can say what's so special about it?
The vast majority of women age 40 go and do their mammograms across the country, in rural hospitals, in high end hospitals. Everybody do mammograms. There is no publicly available data set of mammograms that can be used. So if you are a machine learning researcher, data science researcher, you want to try your algorithm on this data set, it doesn't exist.
So the only way for you to get to this data is to connect to a collaborator in the hospital, and then you need to go through extremely challenging process of actually getting data. And for vast majority of people who work in computer vision, data science, machine learning, there is no clear pathway to get to this data. And we're not only talking about mammograms, about variety of other diseases. Now, in some areas we do have data set.
And here one example relates to a disease, which risk assessment diagnostic is crucial is lung cancer, given that it's the high mortality of this disease. So in that particular case, there is a really big rich data set of CT scans or CT scans, where it was part of a clinical trial, where you have an image and you know what happened to this patient within six years, so you can train the model.
And the interesting part is that the way the patients were included, the inclusion criteria were such that 95% of the cohort are White people. So which means that other African-American, which have high mortality from lung cancer and Hispanics, Asians are not represented in the data set. And then you're training your model, which looks at particular characteristics of the data, and there is really no way for us to say would they generalize in the other population.
But we can go even one step further. Not only necessarily think about the race, but even if you're looking at people who are non smokers-- because one of the things that we observe, and it's true around the world and in the United States, there is significant increase, actually frightening increase in the population who were never smokers, never, who get lung cancer. However, the cohort, the way it was constructed only has people who are very heavy smokers.
So the models that you train-- and there were significant investment of National Cancer Institute in creating this data set. At the end, it doesn't really deliver the goods because you optimize for one type of population while you really want to apply it to the general population. So most of these data sets that we have access today collectively were not designed with machine learning and data science in mind. They were created for other purposes.
And all these different kind of bias, collection mechanisms have direct impact on the models that we are developing. This is actually great and maybe a segue a little bit to what you mentioned earlier about drug discovery. So my understanding in that field is that you're working at multiple scales. On one scale, you're looking at chemical and molecular biology and interaction at the detail level. But then you also have experimental data, and then at some level, you have observational data.
And this data is to be integrated in an interesting way to propose new drugs that then we continue the cycle. So one, just it's complicated to do at this scale, integrating these multi scale. And the other one is, again, the question of bias and diversity and really looking at the different parts of the population and so forth. So since you are in this field, now, what are your thoughts about this?
So data actually penetrates all the levels of discovery because whenever you are trying to predict basic thing-- when you're thinking I take small molecule. The vast majority of drugs that people are taking, it's a small molecule. That's something that, for instance, you take it in. And many of them, the way they operate in our body, there is some kind of misregulated proteins, for instance, then molecule binds and puts a patch, and then we get the behavior that we want.
Of course, there are many mechanisms, but this is one of them. And you need to understand, if I put the small molecule, where does it go? Ideally, it would go to the place that it should go to help with misregulated protein, but sometimes that's why we had side effects because it goes to many other places that we didn't expect it to go. And it also depend on our own chemistry. So the basic step for many points in drug discovery is really understanding this is a small building block.
If I give you the proteins and molecule, where would it go, and where and how it connects. So you learn it based on the data. People couldn't solve this problem for decades using more traditional physics based approaches. While deep learning came in, where you gave the protein, the small molecule how geometrically they're connected, and then you can train the model to put them together. So data is everywhere.
But like with a clinical data, there are certain questions for which we don't have data today because it's easy in the pharmaceutical companies, who don't want to share or for many other reasons. Like, for instance, if you don't only want to know where it connects but you want to know the affinity-- how strongly it connects-- it's actually really non-trivial to get data for that. It doesn't exist. So again, we're not solving the problem where we really need machine learning.
Many times we're solving the problem because the data is available or not available. But talking about bias there, there are many interesting ways how the bias come there into place. And let me give you an example related to data availability. So as I said, the lowest level, you just want to understand how molecules interact, but then you have another-- how do you know what is a good target? Like, you now need to decide which protein you want to connect in the first place.
And one important data resource that is used around the world is UK Biobank. UK Biobank is a big collection of medical data about the patients, which is de-identified. You can apply. You can get it. It is used in pharmaceutical industry. It is used in academia. And one interesting question here is that we are now learning and designing drugs based on one population, on UK population.
And I'm sure some of the drugs would work equally well across, but there will be some drugs which will only work for certain population, but we don't have access to this data. And this is really a funny situation when on one hand, people really emphasizing the privacy. And you say, we want to keep our data. We don't want to share our data, but on the other hand, it's fine. You don't share, but then whatever is developed may not be developed for people like you.
And we know that there are a lot of diseases, which really are very much ingrained in your genetic pool. And if it is underrepresented in the set, most likely it's not going to be very effective on you. That brings up so many interesting questions about bias to me because as you mentioned, when I ever think about bias in AI, the examples that come to my mind really have to do with ethnicity or gender. And I never thought about smokers versus non / or I imagine age-- young kids.
The treatment for a seven-year-old is going to be potentially very different for the treatment for a 40-year-old. And so how is it possible-- I mean, I'm sure there's lots of different angles that you could come at to fix this. I mean, is one option to have regulations imposed on declaring sources of data used to train these algorithms, or what angle would you come at to fix this?
So I think that the danger is unknown unknowns because if you let's say you mentioned gender, age maybe like a history, clinical history. If you look at-- an even most traditional papers that is published in medicine, if you use normal statistical model with predictive power, they will break it up and say, this is what happened to women, this is to men. This is for different ethnicities.
This part you can directly apply it and see, and now, there is a lot of, obviously, emphasis, making sure that all the different parts of the population are represented. But the problem is that most of the time when we are getting these data sets-- and now, it's even funnier because people who create data sets are not the one who analyze data sets. We really don't know what is the bias.
And this is one of the things that I find particularly troubling with the recent FDA regulations that just recently came out and we as a public are supposed to comment on them. Relates to the fact that the way they think that we can address the problem of bias is if I give you all the statistics on my training, and I'm going to give you all the statistics on the test, and then you miraculously would know whether you can apply it to your population.
But the population is not just described by eight variables. And it can be something that we haven't even measured, and we don't know that it's a bias, or there is some particular skew. And that's why I think that thinking that we can explicitly put the tables to a human and we with our mind can identify the biases and abnormality, I think it's really misguided idea because it is a statistical question. If I now give you a population, can you tell me?
Is it distributionally similar to that other population? Can I identify in my data that was collected by many different centers? What is the portion of the data that's somehow statistically abnormal? Can I have a tool that will tell me-- I train my model on this population. Now, I'm applying it in my hospital. That will tell me you shouldn't be trusting me on this patient, on this specific patient, because it is different.
And the example that I always give in this case, when we are thinking about our cars or our microwaves. They all have this device that tells you the car is not working. You need to bring it to the garage. It's not because we understand it, because we can open it and see what happens. There is something that alerts you and say you shouldn't be using it. We don't have it today in our clinical AI models.
So I think, of course, there is an effort, and we should put effort in collecting and making sure that it is as representative of our population as possible. But I think another big part of it is actually algorithms and statistical models that can identify this troublesome population and prevent them from happening. So Regina, I think what you're alluding to is a complete system change for us to be able to do this right.
Because at one level you mentioned that some of the data is available, but it's not shared. Another level, we actually don't have enough diversity in the clinical trials. And then we have failures of the system to be statistically correct for a certain subgroup of people, which requires some measurement and so forth. So these are very different things.
One is almost like you have to create a data sharing market, and the other one you have to encourage people that are of not so prevalent features to go for clinical trials. And the third one is actually a better understanding of the statistical implications of all this work. I mean, who is doing all of this? Who's actually bringing all of this together? So let me start with the third one because this is something that we at MIT actually have capacity to do.
I think that there are a lot of very hard algorithmic questions that are currently not addressed because for the longest time in machine learning, be it clinical machine learning or any other machine learning, the only things that we cared about is our accuracy on the test set. So the first step was to realize that actually we are not testing it on the test set.
We're testing it on a diverse population, and every machine learning is a transfer learning because there will always be distributional shift, no matter what you do. But then there are all these other questions that were very strongly studied in traditional statistics, like uncertainty estimation, calibration. All these things, they were not second class citizens. They were n plus 1 citizens in traditional machine learning.
So I think that one of the things that the technical community needs to do-- and now, it's happening, but maybe a bit slow. But really bring this capacity that we need to have available to the regulators, to the practitioners to provide them as part of the models that optimize accuracy. Like I could give you even a simple question that we had. We have a hospital network in developing countries that utilizes the AI tools that we develop in general clinic.
And one thing that we've done was to say, OK, I have my tool. I tested it fine on many populations, but now, I'm going into your population. I am going to validate it on 10,000 examples to make sure that it predicts correctly, and then you can use it. And in many of these hospitals, people say, we don't have 10,000. It costs us a lot. And then others said, so 10,000 is enough. Maybe I should do 20,000.
So in ideal case, you would have some statistical estimate that says that if you want to have this error bound, that's how much you need. We don't need to rely on our intuition on some number 10, or 5, or 20. We don't have these tools, and there are mathematical solution to all these questions.
So I think that we as a research community should really prioritize this question and develop these techniques because one of the reasons you can say that FDA was incorrect by assuming that it's human's job to monitor and say, is it diverse enough, is it biased enough. How do I know? But the truth is we don't provide the regulator with the tools with a well-equipped statistical machine learning tools, which can help them to guide them to answer those questions.
So I think this is up to us at this point because we need to do the development. Now, relate to other questions about the data. I think it's extremely complex phenomena because there are many different stakeholders. There are legal questions because HIPAA was written for insurances with insurance in mind. It was not written with machine learning in mind. But at the same time, there are some encouraging developments.
For instance, NIH works on all of us on this very big million people data set, and Broad is the one who actually collects it with other centers. So there are ongoing efforts in various parts of the world, but it's still very fragmented. And even if you're thinking about places, like all of us, it's still very hard to use it because you need to use it on their computer. It's very expensive. So it's something happening. I think that we all need to put our effort to really make it happen.
And regarding the second point, Munther, that you pointed, which is very correct. Really ensures that all of us are contributing and donating. I think that to me we really need to educate the public about really this dilemma. That if you want to de-risk, you actually need to help by donating your data to be part of it. Because if you are not part of it, the tools are not going to be optimized for you. And I think that this point is lost today.
It's such an interesting question because I think when people think of de-risking the AI, they think of protecting their data more and not giving it up. And I think that brings me-- this is probably a question I should have asked a while ago, but I can't help but think about it and want to really make it clear for our audience.
When I hear the words bias, or when I hear, oh, a regulatory body needs to come in and tell you whether things are too biased or not right, it makes me imagine that someone did wrong. That there was something almost intentional to this bias. But I'm not sure that's the case.
And so could you talk a little bit about whether there was sort of intentional bias ever put into these data sets or whether the bias has always been accidental, and therefore, the fix to it is really education rather than punishment. So in majority of cases, the bias comes, I think, without bad intention. And I would give you an example of bias that happened in my lab, and it's a very bizarre bias. And remember, when we just started in 2015, work on images.
And what we tried to do is to predict whether the patient has future risk of breast cancer. But the first test that we did was actually look at the image and predict whether there is cancer. And to my surprise, when we built the first model, the student came back and said, we got 99.9 on the test. And you know and Munther knows that when the student tells you something, something is wrong. It cannot be true. So my first decision-- or maybe-- maybe test and train are not separated.
Maybe they are the same. It wasn't the case. We literally spend two weeks trying to-- because we knew something is wrong, but we couldn't understand what was wrong. We received the data from another hospital, and we couldn't understand what was wrong. So after literally two weeks and a lot of exploration, we found out the reason behind it. And the reason behind it-- that for whatever reason, the data providers put all the positive cases-- let's say from 2010 to 2012.
And all the negative cases-- from 2014 to 2015. And in the middle, they changed their machine. And on the image, there is an imprint which device produced the image. So in that case, you had full confounding variable, which is the source of the image, which perfectly correlated with negative and positive things. So if you just give the image, which is a very simple and deterministic task, to say where the machine comes fro, you can on test do 99.9%.
And if it wouldn't be 99.9-- if it would be 88, I would say, wow, we really did a great job. It's just 99.9 never happens in reality. So the point is that it took a lot of investigation-- target investigation, but you can miss those things. So a lot of this bias come from the fact that there was somebody who made a decision. They never shared it or documented it. Another example-- again, a bias.
I remember I was looking at some prediction from tabular data for predicting who is going to get breast cancer or something like this. And in that particular table, one thing that strikes me was that the-- again, some humongous portion of the women in that list had no children. It's like, wow, the number of children correlates-- increases your chance but not that strongly. So I asked them about it. Again, just really randomly scrolling.
And they ask, and they told me that the women have this questionnaire. Some of them put the number, whatever is the number of children. Some of them just don't for whatever reason. And then the software automatically put everywhere zero, and nobody paid attention to it. This was some default, and nobody pay attention. And then you can come and build yourself a model and make the prediction that women with cancer have no children.
So lots of it is just really low level mistakes that we don't even know how they got there. There are truly no bad intention. But I would give you another example, where people are very honest about the bias, but it doesn't stop physicians from using it. So there is a model, which is called [? Tarakucik. ?] This is a model that looks at your categorical data. If you had children and when was your period and whatever and decides what is your risk of breast cancer.
And the reason it is used because according to US standard of care, if you are above 20% risk, according to this model, you can be screened with MRI. You can give certain drugs to decrease your chance of breast cancer. There is a whole bunch of things that you can get if by this model you are predicted to be in risk. I mean, this model is not great. The accuracy is around 65 AUC when 50 is random.
But the funny part that it performs like really abysmally for African-American, for Hispanic, for Asian. And the reason is that this model was created-- it's a normal statistical model. It was created in London decades ago on White women. And it was described as such in the paper. There was nobody who made an assertion, but it is used for everybody else because there was no other model. So this is an example. When person just described what they did-- they had the data.
They did it for British population, but then it came in, and it started to be used in all the other places. So there are lots of sources of-- it comes in many different ways, none of it with bad intention. And that's why I think we really should stop using human intuition try to identify. Which is great, except that also regulation and so forth, as you mentioned earlier. There's a lot of human intervention, which is-- as you said, can make big mistakes and so forth.
So there's one philosophical dilemma that I want to present to you because we've been talking about this, at least, with this initiative that we have on systemic racism. Because at some level, we want all algorithms to be freed from biases, and so we remove features that describe you being Black, or being from a poor community, and so forth. Yet at the same time in health care, some of these features are actually critical and important for diagnostics. And so we're trying to balance.
When is the feature important to be included, and when is the feature not important to be included? And that applies-- that extends to socioeconomics and demographics. I mean, we know certain areas in the United States have a high probability of getting cancer. It's environmental. It's habits and so forth.
And yet at the same time, when you start asking questions about using census data and locational data for diagnostic, you get a lot of pushback because you are biasing against groups of people. And so how do we manage this tension between what we think we're doing the right thing but ignoring important data for good diagnostic, and hence therapeutics? I think that first of all, when we are thinking we are removing the data, many times we are not removing the data, and we've demonstrated it.
[INAUDIBLE] group demonstrated. There were many people that demonstrated that, for instance, looking at the image for mammogram, you can predict the race of the woman. So even if you remove the race, it's still there. I mean, it's still there. Not in a way that the doctor can detect, but it's still there. But I actually think it's not about removing in the training because the only thing that we care about when we're predicting health outcomes is to be as accurate as possible.
I think the question is, what happens next when you have the prediction? Because we know that there are a lot of diseases that really are specific to a subpopulation. So I'm an Ashkenazi Jew, and there is a whole slew of diseases that happened in Ashkenazi Jews. Would it really help that you remove this information, which would most likely result in the decreased accuracy for predicting for me? I don't think so.
But what we need to make sure that once we discovered and we know potential outcome, then we're actually fair, and the system is systemically kind of mid-service. And I will give you an example for breast cancer. For instance, for a known reason, many African-American women get onset of breast cancer very early on-- before 40. And today, the US regulations that were just published like few weeks ago, the screening starts at age 40.
I know endless amount of African-American women who were diagnosed before. And if you are doing it for one population, and you're not thinking about other population, especially population which are young women and they don't even think about breast cancer, you are putting yourself in the situation that their health outcome are going to be much worse.
And indeed, this is the case with African-American women with breast cancer, whose outcomes are much worse than for White patients because many of them, due to the lack of screening and awareness in young age, they miss a critical time when they could have been treated with maybe less invasive treatments. So I think it's more about what do you do, and how do you create regulations that are equal across different groups? And it's true for other cancers.
Again, non-smokers and others, like in Asian population. The fact that in the US you need to be smoker to be screened, the ethical group which have more prevalence of lung cancer are discriminated against. So it seems like there is really concern about the efficacy of existing treatments because of this. Just like if you're an African-American woman, maybe you should be going for a mammogram starting at 30 instead of 40. And so is this really a big issue right now?
Are there a lot of examples of this, where you feel like the efficacy of certain treatments are really in question because of groups? I mean, should the public be worried about this for different diseases and different treatments, and what can they do? What can someone do to know how they should be treated based upon whatever their ethnicity is, or their environment, or whatever this is? Is there any resource for them? So I think, of course, it's an extremely complex problem.
There is a lot of documentation that there are certain groups that systematically have low quality health care. And for instance, we know that prostate cancer unproportionately in terms of the severe cases affect African-American men. And I was just recently listening to a talk, and they were describing that since most of them are not treated in places like MGH, even the quality of biopsies are not good. That they come.
Sometimes they're treated in the centers, where there is explicit difference in the health care and access to insurance and so on. So this-- but of course, it's super important, and I'm sure there are many other people who can speak about it more effectively. But the question is what we can do as a data scientist to stop it or at least to interfere. And I think that the problem is like-- as you suggested, let's say we start screening women at age 30.
And the question is, what are you going to do? African-American? Are they going to be screened every year? Are you going to do Ashkenazi Jews because many of them have BRCA? And then it starts. What is the population? And I think everybody would be fine. If you do a first mammogram at age 30, you look at the image, and you say, this woman is unlikely to develop cancer in the next 10 years. We don't need to see her for 10 years. This woman really should be coming every year or every two years.
Then you can design patient-specific intervention based on their personalized things, not even like a bigger group. Because if you look at African-American, there are many different types of African-Americans, and they have different types of predispositions. So the point is we need to have a predictive tool that can look at the patient and say, this patient, this is their likely trajectory. That's what they need.
So you can service a broader population without burdening the system economically and at the same time not overexposing them to radiation and other side effects of treatment. So I think that AI can actually play a really significant role in this change. So actually, tying this to what you said earlier, and that is the impact of insurance. So I have a history of colon cancer in my family, so I screen. And because we have a history of colon cancer, we cannot screen in a non-invasive way.
You have to have an invasive way, which always results in many different tests and so forth, which is a pain, right? But the insurance company will not accept a prediction of my case right now. So in other words, even though every test has been clean and many doctors feel that, OK, there's really 10 years here before you come back, the protocol is every two years, and they will not deviate from the protocol because they're afraid of a lawsuit.
And so this is what's happening with insurance and this personalized medicine. Is that what we need to do is potentially do more of clustering and so forth. I don't know if you have thoughts about this. Well, absolutely. Absolutely. Because if you think about it, like, the major societal dilemmas, the screening frequency, it's a big one, correct? Because in your case, it's invasive procedure, which may have side effects, which is really interrupting [INAUDIBLE].
If you're looking at mammograms, which are done, it's like a big decision. There are lots of people, who I call them mammogram deniers, who thinks that it doesn't help. That there is no point to screen women every year. That maybe it increases cancer because they're irradiated. There is all these dilemmas. And the reason is we are trying to create a policy. This is one cluster. This is a rough cluster, and everybody has to follow.
And if you think about-- I always give this example, but I think it's really telling. If you look at Amazon, it's not like Amazon divided us-- you are women with PhD after 50. You are a teenager and so on. We have a very flexible mechanism that look at everything we click and buy and do, and we get a recommendation. Why we cannot have this type of fluent assessment for our health care system? Because we don't. Because we're divided by this very rough clusters, and everybody are getting the same.
I think that's one of the reasons that we are so non-effective. And how we can change it-- I think it's something that we have to change. [MUSIC PLAYING] Thank you for listening to this month's episode of Data Nation. You can get more information and listen to previous episodes at our website, idss.mit.edu, or follow us on Twitter and Instagram @mitidss. If you liked this podcast, please don't forget to leave us a review on Spotify, Apple, or wherever you get your podcasts.
Thank you for listening to Data Nation. From the MIT Institute of Data Systems and Society.
