Can GPT-4o Classify Tumors Better Than Us_ AI-Powered Pathology Insights
Aleks: [00:00:00] Welcome. Good morning. Welcome to Digi Path Digest. We are in our normal series because we just finished the AI series. We went through all the papers from Modern Pathology about AI in pathology. They had a super cool series that we are now with my team editing into a more concise version to enrich the, pathology AI makeover course, and we are gonna update the book with the information from all those live streams and all those papers. So before we dive into the topic of our Digi Path Digest, which is. Our normal journal club revenues I'm gonna tell you a story what happened to me this week. So today you guys are joining me on the day of me becoming a US citizen.
Today is the day of my naturalization ceremony. That is gonna happen in a couple of hours. I have to drive there, So [00:01:00] today is Friday and this is the day of the ceremony. on Tuesday I was just making myself some coffee and a phone rings, and I usually don't respond to unknown numbers, can you imagine doing this?
In the nineties, like not responding to your phone was unthinkable. And then like early 2000, not responding to your email, like everybody responded and I was like, unsubscribe. Unsubscribe. Anyway, they call me and I see a number that I don't know. And then I see the U-S-C-I-S, this is the immigration services.
United States Citizenship and Immigration Services. And I'm like okay. When I see that I pick up because I knew it's gonna be on Friday. So I'm like, are they moving my ceremony or what's going on? And I pick up the phone and I hear, oh, this is agent and the judge, so I didn't understand.
But. So I ask her, what can you repeat? And she [00:02:00] says the judge selected you to give. Speech on behalf of the new citizens. And I'm like, what? I didn't know, like people were selected to anyway. I said Yes, and I'm gonna be giving the speech today. And I actually so they told me on Tuesday, I'm like, no way can I memorize the speech by Friday.
So obviously I went to all my AI tools and have had. My thoughts unstructured thoughts recorded, then had chat. GPT, clawed, perplexity and whatever else I use helped me craft a compelling speech. It's just five minute speech and I thought, ah, maybe I can I can, maybe I can incentivize you to stay till the end of the livestream when I tell you I'm gonna tell you this speech today.
But I decided, no, let me not. Do it before I'm actually supposed to do it, but once I have done it maybe I can record there and if [00:03:00] not I'll definitely share it with you. But yeah, I was working hard on the speech yesterday on behalf of the new citizens. So if there are any, not US citizens born, not in the US who became US citizens.
Let me know in the chat or any foreigners that are dialing in from the US or any people who were not born in the country where they currently live. Just give me a hi in the chat because of this story. And now let's dive into the topic of our fantastic livestream. Yeah, that was my story of the week.
But now we need to actually read some papers because we have not been reading papers for I don't know how many weeks the seven part series took more than seven. Let's start with. Making the tools work. And if you are, let me know where you're dialing in from.
Let me know in the chat that you hear me, I had a little bit of a problem in the beginning, so just let me know that everything is okay. [00:04:00] And we start with enhancing malignancy detection and tumor classification in pathology reports. Comparative evaluation of large language models. This is so cool because, the large language models are so like part of our life now. I'm so happy that this AI tool made it into the mainstream. This is a group from Austria, and what they did the background of this paper is that cancer registries they need accurate and efficient documentation, and manual methods are obviously time consuming and error prone.
I know sometimes I don't. See the comments coming from Linked in immediately, but keep commenting. I will see them. So let me know where you're dialing in from. Time consuming and error prone. Everything that we do manually, obviously.
So the objective of the study was to evaluate the effectiveness of LLMs in classifying [00:05:00] malignancies. And detecting tumor types from pathology reports. So we're working with text, we're not working with images here. And the cool thing that they described in the second paper as well, that we're gonna talk about in a minute was that they were using a synthetic data set of 227 reports.
they made synthetic reports, which is super cool. Synthetic data is I would say a new discipline in the medical field where you can actually test tools on synthetic data. Fantastic. And then they check the performance of four LLMs and score based algorithm.
And it was compared against expert labeled. standards and the L lms, specifically GPT-4 and LAMA demonstrated high sensitivity and specificity in both malignancy detection and tumor classification, outperforming traditional algorithms. So they had this score based algorithm and the language models [00:06:00] outperform them, and the conclusion is that language models enhance the accuracy.
And efficiency of cancer data classification. And I like this thing, text mining. Data mining was something that was like the hype. It's still, a discipline, but like the first time I heard about it was maybe like five, seven years ago. Data mining. This data mining, and now we can do text mining.
Imagine how cool it is. You can just give this language model all the reports that you have and you don't have to like. Control, whatever that is that you search. Control think, I don't know. Control F is for searching. You don't have to do that manually. I'm thinking like how many reports I had to go through.
Years of reports for my PhD to find the diagnosis that then I had to go into the archive to dig for slides and reevaluate those slides. How much easier my life would've been with large language models and also with whole slide images of all these slides [00:07:00] that I had to dig out of the archive.
But you know what? That's okay. The people who now have large language models are gonna have different stories to tell in 10 years when the new thing comes and they are like, oh, we only had LLMs and now we have this new thing. And it is so much more efficient. So going back to our. Artificial data.
Okay. I don't know if the comments are not working because I don't hear you. I don't see you in the comments today. Don't do it to me. Just say hi. And in the meantime, let's continue. The same group. From Austria. they developed these synthetic oncology pathology data sets for large language model evaluation in medical text classification.
So they have a data set of these reports that you can test. Against. So the background here is, ah, let's see, let's say where we published this with, where [00:08:00] this was published in stud Health, techno informatics. See, I forgot our format. I should have checked the journal, tell you what the impact factor is.
I haven't done it for a long time, so let's just focus on the content today. Large language models are offered promising application in oncology pathology report classification, and probably in any medical report classification, improving efficiency accuracy and providing automation The use of real patient data is restricted due to legal and ethical concerns, right?
It's always a concern when we use patient data because it is, we want it to be de-identified, but they have been several publications saying, oh, no matter how you de-identify it, you can still identify it. Mixed feelings about it and different. Opinions on this. So one of the solutions that can be used is to develop a synthetic dataset, right?
So this [00:09:00] study aimed to develop synthetic oncology pathology dataset to serve as benchmark for LLM evaluation. Enabling reproducibility and privacy preserving AI research. So they created these 227 synthetic reports. They were generated using Microsoft copilot, Chad GPT plus and Perplexity Pro.
Is that the 4, 4 0 that you have to pay for? Anyway, I think I loved it because they did it to ensure structural and linguistic diversity, right? So they use different models to make this dataset as versatile, diverse, as possible with different models. How cool. Like it's just so surreal that we can now generate this artificially.
So this is amazing. And the dataset included cases of prostate 75 lung, breast cancer, and. Evenly distributed between malignant and benign. [00:10:00] Like you can synthesize the distribution of whatever you're looking for. A challenge in pathology and medical space in general is that the prevalence of disease is always lower, right?
I don't remember. There's a name for this phenomenon data that is underrepresented. There is another word I'm looking for. So if you remember this word, let me know in the chat. But now you can totally just generate the data you want. I don't think we're quite there with images, but with text, you totally can.
And then an expert reads through it takes out the nonsense like I did for my speech, like I was semi generating it with ai. But then like I went through it super thoroughly. Took out everything that sounded like Chad, GPT and put some other stuff in. Anyway, long story short, you can do that with a report as well.
So it's like the other way around. Like when you generate ground truth, pathology scoring pathology report [00:11:00] writing whatever diagnosis, right? Annotations, specifically annotations, you have the subjectivity Of a human that you check against and here it's the other way around. You generate something automatically and then the human checks, if this is correct.
So yeah. Flip of roles. So The reports are reviewed and classified by three independent cancer registers. And the results say that the data set provides a structured, clinically relevant benchmark for evaluating LM performance and pathology text classification.
And it enables ai Oh, sorry. And I'm not making this big. Nobody's telling me. I think I'm not seeing your comments today. Just, I dunno, I'm gonna be like whining throughout the whole live stream. Nobody's commenting. And then I go on LinkedIn and there's a bunch of comments that I didn't respond to.
But that worries me a little bit because if you have questions, you cannot answer them live. But on YouTube, the comments are coming through let me make this big so that you can see [00:12:00] what I'm actually highlighting. Yeah so we now have a data set that we can check models against, which is super cool.
This, I'm gonna just mention by title and not go into it, but there is an interoperability framework of the European health data space for secondary use of data and interactive European interpretability framework based standards. Compliance toolkit for AI driven projects. Why am I mentioning this?
Because there it's a new guideline that we can use. Somebody already developed it. And I got inspired to highlight these guidelines. From the seven part series. There was one specific article about regulatory frameworks and things like that. And there's always this repeated sentence.
I don't wanna say slogan, but repeated sentence. Oh, the regulators are lagging behind. The regulators don't know. Okay, I see. Finally, some [00:13:00] comments. Thank you so much for showing some comment, love, but. It's on, on, on YouTube. So I think LinkedIn is not connecting today. Hello. Thank you so much.
I very much appreciate your comments. It always makes me feel like I'm not talking to my computer, but to actually people. So thank you so much. oh no. LinkedIn is coming through. Thank you so much. Thank you so much. Okay. Going back to my train of thought was that in that regulatory part of the seven part series, there was a huge table with different regulations.
So I think we're underestimating our regulators. They produce a lot of guidances and they put a lot of thought into those guidance. So we don't have to reinvent the wheel. We can go check the guidances. And now we have an interoperability framework in Europe. So we can check that.
And that was published in Journal of Medical Internet Research. I don't know. [00:14:00] I'll do my due diligence next time on all the journals that we're discussing. So there is a framework, oh, this one. I have to talk about this one 'cause these people are from Poland. Hi to all the authors of the publication called Efficient Annotation, bootstrapping for Cell Identification in Follicular Lymphoma, and these people.
Are in, in Poznan where I used to go to high school. My friends, this is where I went to high school. Anything else? Oh yeah, these are like familiar places. This Maria Ska Kiri, national Research Institute for Oncology Tumor Pathology Department. I was a patient of that one with my thyroid cancer when I had thyroid cancer, so I was a regular visitor.
There you can see my scar. So when they took my cancer out, they took it somewhere else in stretching, [00:15:00] but then I went would go for first I went for radioactive iodine therapy there. But yeah, I would travel to this this center a couple of times, like for five years or something like that.
That was that. Let me make myself small again. So let's check what they did. These groups and something else. We also have the capital Warsaw. So what they do? I have everything. Yeah. So's another heart here. Ha. Okay. So background and objective obviously. Annotations are a bottleneck. We know anybody who ever annotated anything for deep learning or any other imageno model development knows that acquiring a substantial number of annotations for developing deep learning algorithms remains a bottleneck.
And the annotation process is inherently biased due to various constraints labor shortages, high cost. Time [00:16:00] inefficiencies and strongly imbalanced, distribution of labels The class imbalance was the thing I was looking for in this other paper class imbalance. imbalance.
Strongly imbalance, distribution of labels means that there is more healthy than sick always hopefully in the population. And that translates into. Consequences for model training, right? If we have an underrepresented class then we don't have so much training data to train on. So then if we have synthetic data, we can generate as much of this underrepresented class as many examples as we want.
In this case, we are talking about images. In the previous case, we were talking about text. So text is easier now. But image is catching up guys Probably next year we're gonna be talking about if they don't do it like I would imagine for like cell classification, not like whole tissue architecture, but I think it's gonna come soon.
And they compare three distinct approaches to [00:17:00] annotation, bootstrapping, extensive manual annotations. Who did that? Raise your hand. Yes, I did a lot of that active learning and weak supervision. So active learning is gonna be, you start annotating and then the model like in a system that learns as you annotate, and then it gives you, it shows you examples of something that it thinks, is the class you are annotating right now, and then you can click yes or no. Yes. No. So it speeds up the process as well. And then weak supervision where you don't do like detailed annotations, but you give a weak label. And they propose a hybrid architecture for centroblast and centris side detection from whole slide images based on a custom cell encoder.
And contextual ENC coding derived from foundation models for a digital pathology. So this is cool because they are using foundation models. We have not heard about foundation models for three abstracts. Actually the large language [00:18:00] models are foundation models. We didn't hear the word foundation models for three abstracts guys.
Anyway, they use the contextual encoding, and they collected data sets from 41 whole images. I was a little under. Impressed with 41 images. But you know what, if that's enough data, then that's enough data. And they had it scanned with 20 x objective lens and the resolution was 24 5 micrometer per pixel.
And that's a lot of cells, like 12,000 cells. Annotations were gathered, so I guess they were, yeah, because we're lymphoma and they're doing cell annotations, so I guess 41 slides was enough and we don't need to. Do more than that. And results proposed active learning workflow led to almost twofold increase in the number of samples within that minority class, and the best bootstrapping method improved the overall performance of the detection algorithm by 18.[00:19:00]
Percentage points yielding macro average F1 score precision and recall of 63%. I would like to see it higher, but maybe this macro average F1 score is very average. And actually for some classes it's better. We would need to go into the paper to figure that out. But yeah, different ways of streamlining annotations.
And they're working, they're improving. And now let me know if you have any questions, any other comments than just response to my begging for comments. If you actually have questions or something that we can talk about, we can discuss even if you're here the first time, that's okay. Just put your question in the chat and I'm gonna be happy.
To respond to it. And now let's discover why there's a heart here. The heart here is because the underlying people are actually my colleagues from my day job from Charles River Laboratories. Of [00:20:00] course I had to review this one. It's Talks path. And it's our last paper. I know that many of my digital pathology trailblazers are more in the clinical space, but if you are in doing image analysis in any shape or form, this is gonna be interesting for you as well.
So what's happening here? Is death. Thyroid tissue is sensitive to the effects of endocrine disrupting substances. And histopathological analysis of rat thyroid gland is the gold standard for the evaluation of agrochemical effects on the thyroid. You may know that for drug development for any.
prescription drug. Some tox safety evaluation toxicologic but not only that, also agrochemical substances. They check the rat thyroid gland in a histopathological evaluation. there is a high degree of variability in the appearance of the rat.
Thyroid gland. And [00:21:00] what does that mean? That toxicologic pathologist often struggle to decide and be consistent to apply a threshold for recording a low grade thyroid follicular hypertrophy. when it's low grade, it's gonna be very subtle. always around the threshold in whichever visual estimation of, amount, size or anything around the threshold. There's always gonna be disagreement regardless. Whether it's this use case or IHC quantification or anything like that. one time I went to this PD L scoring workshop with German MD pathologists. And for different companion diagnostics you had a different threshold for actually prescribing the drug.
for something it was like 50%, something was 10%. Basically below that. Don't give the drug and above this certain threshold you give the drug. they were trying to visually train us to be more consistent, but no matter how much they [00:22:00] tried, it was like. Half day workshop or something always around the threshold when they made us raise hands, half of the people would say below the threshold, half of the people would say above the threshold.
That means do image analysis, don't ask people around the threshold. There's always gonna be 50 50 chance that you are wrong and you don't wanna do this to patients. So that's why I'm an advocate for image analysis and that's what they did here. This was a project where they developed a deep learning image analysis solution that provides a quantitative score based on the morphological measurements of individual follicles.
And this can be integrated into standard pathology workflow. They did it with unit convolution, deep learning neural network, and. It identifies the various tissue components of the thyroid gland and it delineates individual follicles. And then they created as so they did this on the image analysis side, and [00:23:00] then they figured out dependent variables or like how are they gonna.
Calculate the thyroid the, these thyroid activity score. So how active the thyroid based on this. And that was superior to the mean epithelial area approach when compared with pathologist scores. And here we have this chicken and egg situation. They were comparing it to pathologist scores.
They say at the beginning the pathologists are inconsistent, but that's not specific to this particular paper. This is this is always the gold standard paradox. It's my favorite paper that I cite from a friend of mine fka Dr. FKA ef. So yeah, that's it for today. Thank you so much for joining it.
Not doing the seven part series is, significantly shortening our live stream. So you get a half an hour back and so do I so I can get ready for my naturalization [00:24:00] ceremony. let me tell you. I crafted the speech I wanted to convey, like a little bit of my story and all that stuff.
And it crafted a speech for me. Then I went through the speech and I'm like, how am I gonna deliver it? I will not be able to learn it by heart. So what did I do? I asked my AI tools. What's the best way? And I have now a teleprompter on my phone. So what I'm gonna be doing. Before my ceremony, once I get ready and bring my kids to daycare.
one kid is actually going with me and the other one is gonna stay at daycare. I will be practicing my speech and next time I'm gonna tell you the speech. If you want. Let me know if you want it. If like you're like, I don't care about your speech, I wanna just do the journal club then let me know as well.
It's a five minute one. Okay. Thank you so much and I talk to you in the next episode.