44: Weakly supervised AI for pathology w/ Geert Litjens, RadboudUMC

Aleksandra: [00:01:49] Today my guest is Geert Litjens. He's the member of the Computational Pathology Group at the Radboud University Medical Center in Nijmegen in the Netherlands. He is an author of many publications on the subject, co-organizer of the CAMELYON16 and CAMELYON17 Image Analysis Challenge and a renown expert in computer vision for digital pathology. Hi Geert. How are you?

Geert: Hi, Aleksandra. Very good. Thank you. Thank you for having me.

We're going to be talking weekly and unsupervised learning in computational pathology. I have some understanding of this, but what I have seen across the industry, especially on the side of non-computer scientists is, that this is perceived as some new magical that computer vision is now able to do and maybe we don't have to do annotations anymore, and we can just use this [00:02:49] weekly or unsupervised learning and take the deep learning and pathology to the next level. So I invited Geert to talk to us about it and to talk about the potential of this method and also about the limitations.

Yes. So I'll go ahead and I think Aleksandra will interrupt me when there any questions.

I will be

Aleksandra: interrupting you a lot because I will be asking questions.

Geert: Very good. Okay. I'll just get started. Thank you for that kind introduction again. Well, I'll just very briefly just go over some very basic applications of computational pathology that I think by now, everybody who's listening to this podcast knows. So for example, as a common task task that a lot of people have worked on in the past is the detection of metastasis of cancer in lymph nodes on the clinical side. This is pretty challenging because these metastases can be very small. We've shown in the past that using supervised learning, so fully [00:03:49] supervised learning with detailed pixel level annotations, you're able to this problem pretty well. The way you generally do it, or as traditionally do it is, you just take batches of normal tissue, batches of abnormal tissue, you feed that to your deep convolutional neural network.

Obviously, these patches are based on detailed annotations by a pathologist. Then you can train a deep learning system, convolution neural network to express or learn what is tumor tumor and then you can apply that to a whole slide image and you get a sort of result on likely map or segmentation, whatever you want. For lymph node metastasis, for example, if you have fully annotated slides, that works pretty well for small metastases, but also for large ones. And in many cases now for lymph node metastases but also, for other applications, it has been shown that these types of deep learning convolutional neural networks can outperform pathologists if it's for-

Aleksandra: It's [00:04:49] used for the CAMELYON Challenge. And I just want to mention that your group was a co-author and co-organizer of this challenge, right?

Geert: Yeah. So we were. For the CAMELYON challenges, we were the main organizers, but we didn't have the best algorithm so the best algorithm won first.

You didn't win? Oh. That would be-

We didn't win. We did organize it, but we

Aleksandra: No-

I mean, not good that you didn't, but it would be suspicious if you would.

Geert: Yes. So luckily there were people better than us. So this group from Harvard Medical School and MIT the end won the challenge and they later started a company that might be familiar to many of the listeners, Path AI.

Aleksandra: So, here we do of the epithelial cells in the lymph node, and we can train the network to detect it?

Yes, exactly. So I think this is pretty traditional, what most people know from computational pathology and is also something that pathologists might not like so [00:05:49] much because annotating these cells in detail is just, yeah, it's a massive amount of work. Of course, there are tricks that you can do to work around this. So we've done this for mitotic counting where you just use immunohistochemistry as a sort of surrogate reference standard, but you cannot do that for every task. And yeah, it's more expensive. You need to go back to your tissue blocks, you have to restain, you have to register so also a lot of other complications that come up, but there's an alternative already to pathologist-based annotations, but still, this is fully supervised learning where you need to annotate everything. Only here, the human burden is a bit less.

I like this way of annotating because it's a lot more objective. You have a different method, a molecular method, that's doing this and not a human observer and you kind of eliminate the human differences, the inter-observer variability in annotations.

Yeah. And especially for tasks like [00:06:49] mitotic counting, but also for tumor grading, where for the latter one, there's generally no molecular stain, but yeah, it helps making your reference standard bit more objective and also, your training data a bit more objective. But yeah, like I said, that's also not possible for every single task. For example, if you want to predict patient survival from your slide or maybe some form of genetic mutation, generally you cannot annotate that and often there is also not an antibody you can use to highlight these regions in the slide.

Geert: So this is where other types of learning come in. So we've discussed supervised learning, and I'll first, just briefly go into weekly supervised learning. So just to maybe sketch a bit the difference between computer vision and computational pathology. So in computer vision, generally the images they work with are in the order of magnitude of couple of hundred pixels times a couple of hundred pixels, and maybe at the most like a thousand times a thousand. means that you can just, regardless of [00:07:49] the content and the image, just train end-to-end convolution neural network so you don't generally need pixel level annotations for classification task.

Aleksandra: you basically feed the whole image if it's just a natural

Geert: image?

Yeah. If it's just a natural image, you tend to just feed the entire image. As everybody in computational pathology knows, for a slide, that's not so easy. A typical slide consists of more than 10 billion pixels. Just to put it in context with common computer vision dataset, so one single slide is already one 10th of the ImageNet Challenge and the ImageNet challenge is a million images. So just to give a bit of skill comparisons.

Aleksandra: I love this comparison. I love taking this because, like you said, everybody knows they're big. Like how big are they? I usually give the comparison, okay, one image can be like a two-hour HD movie, but-

Exactly.

... it's also in the computer vision context, it's already one 10th of ImageNet and ImageNet [00:08:49] is supposed to be like this massive database of images.

Geert: Yeah, exactly. So the thing is that with the images that we're working with in computational pathology, putting an entire image naively into a deep convolutional neural network is impossible. Even if you would have the fastest super computer on the planet, it still would be completely infeasible to do that. So people-

Geert, question

Aleksandra: here.

Yeah.

Do you see it being possible anytime in the future? Or like what would have to happen for this to be

Geert: possible?

Yeah. So you see that certain aspects of computation are still growing close to exponentially. So for example, we're still doubling the amount of computer memory every X years. We still manage to get bigger and bigger hard drives in SSDs. So I will not say that it's not going to be feasible ever, but it's still going to take quite some time, multiple decades probably before you can run [00:09:49] actual experiments with all slide images, naively.

And then the additional question is still is that then the most efficient way to do it, just brute forcing the solution, or can you also come up with a smart way of doing this? I think several groups have already shown that this is also feasible without having the super computer from the future. So just to sketch that a bit, so we've also already discussed the of pixels. So generally, for tasks where you have slide level labels, for example, patient survival and something that you want to predict, you cannot even make these annotations. But even if you could, they're time-consuming, there's disagreement, all these things we've already discussed.

Another issue that people tend not to realize when they naively work on the problem is, that the neural network can also only sees patches so they have no clue about the global context. So where is this patch in the tissue? Is there some coherent-

Geert.

Yeah?

A [00:10:49] question here. This is true for the supervised as well, right? Even if we annotate the full slide, the network only learns fragments?

Exactly. Yes. So if you train a patch-based network, so based on tiles from the image, you know that the network can never put or learn the context of this patch. For example, in the lymph nodes, the anatomical structure of a lymph node is completely unknown to the network and also cannot leverage this knowledge to make better predictions. That's where you, for example, sometimes see stupid, false positives that a pathologist can easily discard because they know, "Okay, this group of epithelial cells is probably just a fragment that is appearing there because of surgical artifacts instead of being an actual metastasis." And if you would have global contacts, a neural network could also learn that, but that's also true for supervised learning. If you only have patches, that's something you cannot learn.

Aleksandra: So then even if you have like perfect annotations [00:11:49] of everything, there is a limitation because you cannot see the context?

Yep, exactly.

Okay. I think, this is often forgotten and the method to improve is like more annotations, better annotations, more annotations. At some point, there is no improvement anymore and I guess this is one of the reasons why this is happening.

Geert: Yeah, yeah. So this is one of the reasons. course, there are tasks. For example, the lymph node metastasis detection that you can solve for a large part locally so you don't need a global context for many of the questions you can ask, but there's also other things. So you have the global context of the slide, but for example, also patient information, things like age, history, outcome of blood-based tests, all that kind of stuff. That's not physician's pathologist can use this information to come up with a better diagnosis and this is still very uncommon in computational pathology.

So we're

Aleksandra: not yet using this multimodality [00:12:49] of information?

Geert: Not really. So there are some research papers out there that look at that, but I don't think there's like a consensus way of incorporating this information.

Okay. So for weekly supervised learning, so there's a very famous paper from the group that later formed the company Paige, probably also very well known to the listeners of this podcast, paper from Campanella et al, Nature Medicine 2019. I think this was the first large scale application of weekly supervised learning to histopathology images that actually had very competitive results to supervised learning. So people have tried these types of algorithms before, but I think this was the first time that these results were really competitive with fully supervised learning

Aleksandra: and-

Speaking of different spin-off companies, do you guys, does your group already have a spin-off company?

Geert: Yes. Since last year, we also have a spinoff company. So thank you for allowing me to advertise it.

Of course. Go ahead. That's[00:13:49] what this podcast is for!

No. So our group late last year spun out our startup, Aiosyn, and they've gotten some nice initial seed investments and they're now working hard. So a couple of my former PhD students together with Jeroen van Der Laak, a scientific advisor and the experienced CEO, Patrick de Boer are now trying to all these algorithms, all this stuff we developed over the past 20 years to actual use both in clinical practice and in research settings. So,

Aleksandra: I'm going to give you a couple of months to accelerate a little bit and maybe we're going to meet again to talk about the company.

Geert: Yeah, sure. That would be nice. Yeah. So that on spinoff companies, so

I think

Aleksandra: we'll link to your website in the show notes.

Geert: Yes, I So what they did in the Nature Medicine paper is actually not that complicated from methodological perspective. So they still took a patch-based approach. They essentially divided the entire slide into patches [00:14:49] and put all these patches through a convolutional neural network, so very simple patch-based classifier. But then they said, "Okay, the patch that has the highest probability of containing cancer, that's the patch that we consider representative for the entire slide." So they still classify the entire slides based on patches, but they only take the top one patch as the slide level prediction.

Well, naively, you lose a lot of information, right? Because in the end you're only losing one patch per tile, but they solve that by just using tens of thousands of slides. Then when you do that, then you still have tens of thousands of images, so patches in their case, and you can still get very competitive results to settings where you would completely manually annotate all the individual tumor nests. That was a bit of a wake-up call, I think, to the community that these types of methods are actually feasible as long as your dataset is large enough.

Aleksandra: I think this is, I [00:15:49] guess you're going to have more advanced methods still, but I think this was the moment where it was, oh, if you have a lot of data, then you don't need to annotate so much. So there was like this camp of, "Oh, you have to have a lot of annotations," and then it was like, "Just throw so many images as many as you can, and then it's going to be good enough as well."

Geert: Yeah. Yeah. So I will get into a bit more advanced methods. It's important to realize that the tasks they worked on in this paper are also tasks that with supervised learning, so fully supervised learning, I think you can solve with tens, maybe hundreds of slides. So the difference in scale that you need to actually make this method as they implemented it work, is like a factor two orders of magnitudes, more slides you need to make this work.

Oh, okay.

Aleksandra: So, you either sit there for hours and annotate, or you gather a lot of slides and don't annotate, which is very difficult.

Exactly, but yeah, it's not [00:16:49]

Geert: always easy to get, yeah.

Aleksandra: question. One question , before we move. So they chose this style, this patch of highest probability. How do you know? How does the network decide that this is the highest probability? If it's only weekly supervised, it's just a label that there is cancer, right? How-

Yeah, so you leverage the knowledge that in slides that have a negative label, so that don't contain cancer, there you're 100% sure that every single patch should be normal. So for those actually without annotations, you already have the correct labels. So for the slides with cancer, there is the question, okay, which patches within that slide actually show cancerous cells?

Geert: So when you start learning, the network will be pretty bad at that. But then as with other patch-based networks, so you can just say, "Okay, I know that in the normal [00:17:49] cases, everything should be normal so I can learn the normal essentially." Then slowly, you learn what is abnormal so you iteratively get better at identifying the abnormal patches.

Then, so after a bit of training, you get reasonably good at finding the cancerous patches in cases with cancer. And then you also start leveraging those in your learning process and you get a better and better understanding of what is cancerous as well. So that's, I think the key that maybe some people don't realize the fact that this works is also because for the slides that don't have cancer in their application, you actually know 100% that all the patches are normal.

Aleksandra: That you're correct in your label. Yeah. I think that's maybe often missed because from the supervised learning, you're so focused on the lesion, on the change that you're annotating, that you kind of apply the same type of thinking [00:18:49] to anything else. And here, like you say, whatever is non-cancer is a hundred percent accurate. Yeah.

Geert: Essentially all the non-cancerous slides are still fully annotated.

Aleksandra: Exactly. Exactly.

Geert: So that's the advantage that you're levering with this method. But yeah, like I said, it is a bit of a naive method in the sense that you're reducing every slide, which has a lot of information, to one patch and that's not a very efficient use of data. If you go to more complicated problems than cancer, yes/no, becomes more challenging to find enough slides to make this method work and you will actually need to leverage all the information you have in the slides you can collect.

Okay. So I'll just move on to another strategy. This was developed in our group and then later developed by other groups. But the idea here is that you actually decouple the classification and the feature extraction process. So the way this works is that you still, there are [00:19:49] patches from a whole slide image, as you can see here. So you extract all the patches from the whole slide image, but you use pretrained deep neural network. And now that you can do this pre-training in several ways.

I'm not going into too much detail, but for every patch, you can then extract a feature vector, which is much smaller than the original patch in terms of memory. And then you can stack these feature vectors together, such that they form again, the original image spatially. Then you can train a classification network on this much reduced representation. So essentially, this is a compressed version of the original image. Then you can train this part for whatever task you want, and then you can actually leverage the entire image and you're no longer restricted to the information that is located in a patch.

Aleksandra: So, okay. So you're reducing the dimensionality of the image, but increasing the amount of context that you work with?

Geert: Yeah. So [00:20:49] the amount of featured extracted from each batch, that is what you still use and this feature extraction and this actual, the end-to-end classification. So here we feed in the entire image into a deep neural network, but because this image is compressed actually works or fits.

The disadvantage is that these two steps are disconnected. So you need this compression part, this encoder network to actually extract relevant features from the batches for your final task, because you're not training this in one go. You're first extracting the features and then training a network to do whole slide image classification.

So tell me if I'm wrong or not. So this to me looks like like this kernel that you have on the 2D image that goes and extracts the average of the pixels around, here, you have this like in [00:21:49] a 3D structure where you can stack multiple together,

right, or not?

Yeah. So actually, pathology images-

Or am I oversimplifying?

Pathology images are in some sense already 3D because they have three dimensions, the width, the height and the color dimension, but this color dimension is only three samples large. And in these types of convolutional neural networks, you actually reduce the spatial size so you reduce the width and the height. So you can see here that these feature vectors have a width of one pixel and a height of one pixel, but you increase the feature dimension. So this color dimension is increased. So it is sort of, you make it more 3D. You can see it like that.

Okay.

So you actually compress the context into the depth, so to say.

Aleksandra: And this is going to be on YouTube so whoever is listening and wants a visual, go ahead and click on the YouTube link because then maybe it's going to be easier to [00:22:49] understand.

Geert: Yeah. Yeah. So also don't want to spend too much time, but there was actually a group from Harvard, from Faisal Mahmood's group who actually expanded this approach and they combined it with attention mechanism, maybe too long to get it too much detail. But the-

But let's mention this attention, because this is another like new buzzword, that came into the game, at some point. At least to me, as somebody from outside of the computer vision community, there was this attention and at some point attention was everything. Where is it extracting this information?

So an issue is here is that you use the entire image, but not the entire image is relevant to the question you're trying to answer. For example, this background is not interesting if you want to see if metastases are there or not. And this network has essentially no way of saying, "Okay, this area is important and this area is not important and that [00:23:49] complicates a classification.

So what the authors in this paper did, was they added an attention mechanism. What an attention mechanism does is, it essentially looks at the features coming from each batch and based on these features, determines whether this batch is important for classification or not. And the way it does that is, it gives it a value between zero and one, where zero is not important at all and one is, very important. That way, you can actually only use parts of the image, the parts that are relevant, and this attention mechanism is also trained. So it is essentially learning what to look at.

Very similar, I would say to a pathologist. So when a pathologist makes a diagnostic decision based on a whole slide, they're not using the entire slide to make that decision. They're using very specific parts of that [00:24:49] slide to make that decision. And essentially this network was the first to integrate that into

Aleksandra: How is it happening? How can you train an attention mechanism?

Geert: Yeah. So you start essentially by just saying, okay, everything is important. And then over time, you learn in a day like you learn what is in a batch is important to come up with your classification? You also learn which batches are important for your classification. So it's the same learning mechanism, it's just a different way of looking at the problem.

So here you just say, "Okay, I want to know which features in these patches are relevant for my classification," and this attention backbone is what they call it here, determines, okay, which of these patches, which of these parts of the image is important. And the advantage is if you do that, you can actually visualize this attention as well. So these are some nice [00:25:49] examples of that, I think. So here you see in red is high attention, so what the neural network thinks is important for the diagnostic decision and blue is low attention, so what is less important for the diagnostic decision. I can see they put some nice... I think it's a cytokeratin marker, where you could correct me if I'm wrong-

It is cytokeratin. Yeah.

So you can see that this attention aligns very well with actually the cytokeratin. So the network has indeed learned which areas of the image are actually relevant for the prediction. Even if you look at high resolution, this still matches very nicely.

Aleksandra: So a question; is this step in the direction of explainable AI that is now another buzzword or another important concept? Basically those buzzwords end up being important concepts, especially in healthcare.

Geert: Yes, partly so. So in this case, I think it's pretty clear that attention is also helpful [00:26:49] in identifying the important areas in the image. And these areas also correspond to things we know that are important, but we should be careful in terms of, if the tasks are a little bit more complicated, for let's say the-

Molecular marker prediction.

Yeah. Molecular marker prediction, then sometimes attention can also be confusing. So the neural network is pay attention to regions that a pathologist would think, "I have no clue. There is nothing relevant there," and then you have to be careful that you don't explain the attention map yourself, because then the network is not explainable, but you are explaining what you think the network is doing. So then you are explainable, but not the network. Complicated.

Yeah. It's a bit the same. So as with other explainability techniques, if it's just visual, then it's still the person looking at the image who interprets what he or she's [00:27:49] seeing, and this can be correct or not correct. In this case, I think it's easy to interpret, but this can also be very hard.

Aleksandra: One-

But also, again, you have this molecular marker, the IHC marker that is kind of objective ground truth?

Yes, objective.

Not just... It just happens to be easily visually. If it was an immune cell marker, there is no way of distinguishing this visually-

No.

... but you still have a ground truth. Assuming antibodies work and everything worked in the

Geert: lab.

Yes. That's also not always the case.

That indeed.

But one important thing to note is that this network also still has a separate feature extraction step. So this part and the second part is also still separate. It's not trained end-to-end so the features are learned separately from the rest of the network. And that can be, so maybe just summarizing, they actually compared this [00:28:49] approach to the approach from Campanella et al, that was in Nature Medicine so the

Aleksandra: SO this approach is called? What's this approach called?

Geert: CLAM.

Aleksandra: And this stands for?

Geert: Clustering constrained attention multiple instance learning. So it's also multiple instance learning, but it has attention and it has clustering. Whereas I think the clustering is not so important, but the attention part is very important. you can see, and that's what I explained previously is, so what they have here is a couple of different tasks and they compare to the method from Campanella. So the method from Paige, they compared their method too, and what you see is, that if you have lots and lots of data, then they're roughly equivalent. CLAM is still generally a bit better, but CLAM can handle much fewer data samples. So here, they have the percentage of the training set, essentially.

So with only 10% of the training set, CLAM already has pretty good results. MIL is much less efficient. So [00:29:49] that's already where you see that in the Campanella paper, you really needed tens of thousands of slides to make it work. And now we are already using more sophisticated methods, you see that the data requirements of this weekly supervised learning are getting much lower.

Aleksandra: So question, how does this compare to the data needed for fully supervised annotations? Is it comparable or are we still can get away with less data if we annotate for this?

Geert: You can still get away with less data. So in general-

What? Then we have to keep annotating.

It depends. So there are tasks for which you cannot annotate so that those are anyway excluded. There are tasks where you actually want a detailed segmentation, so that you cannot do with these types of weekly supervised approaches. So the attention maps is sort of a surrogate segmentation, but generally not accurate enough to be a real segmentation. So there are still many tasks for which we still need to do [00:30:49] annotations, but I think for classification, we're getting quite close to not needing that anymore.

Okay.

Okay. So yeah, very quickly summarizing the methods I discussed up until now. So we had the patch-based classification where you needed fully supervised, or you needed supervised Like I said, you don't have context. So that's a problem. Yet data efficient is not completely fair to discuss because you need annotations for it. So anyway, it's not very time-efficient, so let's put it like that.

Yeah. You have to train the network and you cannot use the slide level label in a patch-based classifier. So you need the annotations. So multiple instance learning as presented by Campanella et al, you still don't have global context because you're still doing a patch classification. It's very inefficient so you need tens of thousands of slides. The rest, you can use the whole slide level label, so you don't need pixel level annotation so that's a big advantage.

For the last two methods that I showed, so the CLAM and neuro image compression, you actually do have some global [00:31:49] context, because with the attention mechanism, you are actually using the entire slide. It's much more data efficient. So it's not as efficient as supervised learning, but much more so than the multiple instance learning and you also can use the all slide image level label.

So there was one issue still that was not resolved with these existing methods and that's that the feature extraction and actually the slide level classification were separated. The problem there is that for this feature extraction, these previous methods always used ImageNet pre-trained networks from computer vision. You can imagine that features that are extracted by a network that was trained on cats and dogs and flowers might not be optimal for histopathology.

Aleksandra: Correct, yeah. Maybe some edges or whatever those...

Geert: Maybe the very basic ones. So like the edges and stuff like [00:32:49] that would work, but the more complicated ones, probably not. But even though that is the case, they would still get pretty decent results. But what you actually want is, also learn how to optimize these features for your task, for your histopathology task. The problem is with the existing methods, you can still not train end-to-end because of these memory bottlenecks and I'll just briefly skip back to the slide-

So a

Aleksandra: question here, is there a way of pre-trained histopathology networks? Like there has been so much work done already on those histopathology images, is there not a histopathology pre-trained network?

Geert: Yeah. Not-

Aleksandra: H&E network or?

Geert: are people who have worked on... So in computer vision, there's now a bit of a hype around self-supervision. So actually networks learning from the data, the structure of the data without any annotations. These types of methods have also [00:33:49] been applied to so then you get more histopathology-specific features, if you would use that strategy, but still not necessarily optimal for your question you're answering.

For example, let's say that I train self-supervised on a lot of pathology images and yeah, then obviously it will learn the structure, right? So it will learn nuclei, it will learn lumina, it will maybe learn differences between immune cells and epithelial cells. But then if I want to classify, or identify something with patient survival, who knows whether these features are the ones that I need to do that classification? Maybe I need

Okay. So we are in this

Aleksandra: same trap-

Yeah.

... that I just fell when I thought, "Oh, attention is going to explain the AI"?

Yeah.

We're interpreting visually something that can be interpreted and have no way doing anything with the things that we visually cannot interpret. Okay.

Geert: Yeah. So to actually solve this problem and [00:34:49] allow neural networks to train end-to-end with whole slide images, my PhD student Hans Pinckaers came up with a very nice method last year. And there was just the simple insight that, if this is your entire image, like this big image which we color coded here, you can actually classify four parts of the image separately and then later stitch that back together. You get mathematically exactly the same result as whether you would do this entire image at once. So in presentations, I explain it a bit like Netflix. You can actually watch the movie by downloading it part by part and at the end, you've seen the entire movie. Luckily, you don't have to download the entire movie before you can watch it, because that would be very annoying. This is essentially the same trick applied to computational pathology. So the network actually sees the entire image, but it only sees it part by part, but at the end, it has still seeing the entire movie.[00:35:49] And-

Is this like the image viewers that you only load the tiles that you are actually looking at on the computer screen?

Aleksandra: Yes.

The same concept?

Yeah. It's the same concept. So you only load what you need at that point in time and then when you move to another area, you forget the previous part because you don't need to remember it. So what we did is, actually we compared this approach again, to this multiple instance learning from Campanella et al which was when we did this work was actually the current baseline. And yeah, improved on it across, so for prostate cancer detection, which is also what this MIL approach, multiple instant learning approach was first used for.

Geert: So we improve on it across the board. We are also more data-efficient using the streaming approach and we have a network that generalizes much better to unseen data. So that was actually quite nice. And you can also use, again, these types of networks, which have trained end-to-end. You can use standard computer vision [00:36:49] techniques to add some form of explainability. So these are the saliency maps or the Grad-CAM approaches that some of you might know.

Aleksandra: I don't know. What are the saliency maps?

Geert: Yeah. So the-

How is it different from attention?

Yeah. So attention is something the network learns itself. So it really attributes values between zero and one on parts of the image that it thinks are important. What you do here is, you take an existing network and an image, and you look at the parts in the image that if you change them, change the output of the network the most. So if let's

Aleksandra: say-

So, like reverse engineering

Geert: the attention?

Exactly. Exactly. Yes. So they also call it guided back propagations. They're essentially reverse engineering, okay, these parts of the image, if I would remove them or change them, then the output of my network would change. So you're actually not-

And that means that these

Aleksandra: are

Geert: important?

Yes. Yes. Because if [00:37:49] the areas that are unimportant, you should be able to remove from the image without affecting the output of the classifier. The thing is that you're then not leveraging something that the network has learned so they're a bit less accurate than attention. So that general UI people prefer attention, but you can also use it as some formal You can see here that this was a prostate cancer detection task that these networks focus on the glands and not so much on the stroma surrounding it so it does make some sense.

So what we then did was actually quite interesting. We actually combined this streaming approach with the CLAM model. So it's called now called SCLAM. And then you actually have CLAM, but you can train it end-to-end. So you have both the attention and you can train the feature extractor and the classification part of the network at the same time, which was very nice because we showed that if you combine these, so you have the optimal features and you have the attention [00:38:49] mechanism, that you can vastly outperform the original CLAM method.

Now, what you have to think here is that we're actually training a convolutional-

Aleksandra: And you showed that on the CAMELYON dataset?

Geert: Yeah. So it also vary data efficient because CAMELYON16 is in total, 400 slides. So we're training only with 270 slides without using any of the annotations and we're actually getting very close to models that use the full annotations. So we're only 0.01 off in Okay. So that was actually quite a nice result. And what you have to keep in mind is that we're out at this, so this is at 20X magnification so we're actually training ResNet-50. So relatively modern convolution neural network architecture, end-to-end with images that are 50,000 times 100,000 pixels without actually needing a super computer.

Aleksandra: So you did hack

Geert: it?

Yes, we did hack it [00:39:49] a bit. What might be nice for people wanting to explore this is, that all of this is open source. you can... it's very easy-

I'm going to link

Aleksandra: to all those links in the show notes and in the YouTube notes as well so everybody can just go there, who is at the level to do it, which I obviously am not, but I'm not striving to be, so that's okay

Geert: But it's actually relatively easy to use. You can just use your standard neural network code and you need six lines of codes to add streaming to it and then you can essentially apply this to slide images.

The last thing we actually did with this, so that was actually nice to do, so actually tried to do this for a clinically-relevant task. We actually applied it to prostate cancer revival prediction. So what we put in was a TMA course. So they're not that big, but they were still like 2,500 times 2,500 pixels, 5,000 times 5,000 pixels, something like that, and we trained end-to-end to predict years to recurrence for patients. We added concept-based [00:40:49] explanations to this, so that actually you can see the relevant areas again, in the image that the network uses to come up with this prediction. Now, we showed that this has prognostic power in addition, with multivariate analysis. So in addition to the original Gleason rate, et cetera, et cetera, this still has added prognostic power. So we also used it for clinically relevant endpoint.

What's nice is that using this explainability layer, you can show if you have patients which recur quickly, so within a year, you see these patterns and we also showed them to pathologists. So what you see here, things like cribriform growth, sheets of cells, et cetera, and these are then patterns that are associated with longer time to recurrence. You can see that this also visually, at least for our purposes .

Aleksandra: So this was already kind of known, pathologists knew that this was associated with prognosis and [00:41:49] your network came up with the same conclusion, or is there some better shading of all the ends? Like, is there something a pathologist can learn from this?

Geert: Yeah. So this I think is a bit too small scale to draw definite conclusions. What we see is a bit of a confluence. So these are, if you look at Gleason grading and prostate cancer, we know that for example, cribriform growth is now classified as a pattern four, where historically it could be both three or four. Actually, now, there's a bit of a movement within the Euro pathologist community that maybe cribriform growth should be noted separately as a bad prognostic factor. That is something that we see confirmed in these results. So in addition to Gleason, four patterns, cribriform growth is a separate prognostic or indicator for bad prognosis.

So it looks like that pathologists can [00:42:49] also use it to at least guide discovery of prognosticaly-relevant patterns. a bit hesitant to really say that they, at this stage, would learn lot from it, because it's more that it confirms their intuitions than that we showed them really something completely new, but obviously that's something that we hope to do in the near future.

Aleksandra: How do I say it? Like in the same way, would it be possible to maybe at some point, translate the molecular prediction into something visually recognizable?

Geert: Yeah. So that's something we're now looking at in several projects. Also, other groups have, for example, already done for example, ERPR prediction for morphology for breast cancer and that works surprisingly well.

Aleksandra: Yeah. I've heard about IHC prediction from H&E.

Geert: Yeah. So also, groups have done that. I think in some cases that you can actually do that. For example, where I [00:43:49] don't believe that will work is, if I think you can, for example, learn the morphological patterns of lymphocytes, right? So for example, predicting a CD3 stain from H&E, I think, will still mostly be feasible, not perfect, but feasible, but for example, than doing a CD8 prediction relative to the non-CD8, CD3 positive cells. I think that's impossible. This information is simply not in the morphology. So you have to be a bit careful with the application there, I guess.

Aleksandra: Okay. So it's definitely not one size fits all?

Geert: I don't think so, but I think there is a surprising amount of use cases that you get, that test.

Yeah. I would say as well that there is a limitation because there is a reason why these are molecular markers. You don't see those molecules, but there's also like a wide spectrum of visual features that we are not really focusing on, but [00:44:49] still can be interpreted. Whether it's possible to teach it to a person, I don't know, but probably it's possible to extract from the image- Yeah, so- computational pathology.

Yeah. So from my own group, we've done, for example, MYC translocation prediction for diffuse, large B-cell lymphomas and that goes surprisingly well. We really are at like 80% accuracy. There is some morphological substrate of a MYC translocation in the morphology of the lymphoma, but we're still working on how to make that interpretable, such that you can actually say something meaningful about what it is in the morphology that allows the network to predict that.

Are you anywhere close to finding it or it's going to be trial and

error?

That's also why I say that these attention mechanisms are not perfect. We are using attention for this so we know the spatial regions that are important, [00:45:49] but what is now exactly important within those regions? Is it like, maybe it's like the nuclei are on average larger than in the other cases. It's currently visually not... So if you would show it blind to a pathologist, I don't think they can see the difference.

You would show it

Aleksandra: to a very, not even blind, educated pathologist who would... Well, if you showed this to me and told me exactly where, what am I supposed to look for, maybe in the first slide I would know, but in the third, like I would be completely

Geert: lost.

Exactly. So there's a-

But there is no

Aleksandra: limitation to our visual cognition, right? It's not a question, is it there or not? It's, is it possible to consistently recognize in the images? Can a human being do that? And I can say, I probably can't. I barely can do the percentage estimates that [00:46:49] pathologists are routinely asked to do, and these are correlated to survival and the Kaplan Meir curves are being based on this. I'm like, "Don't use mine, because that's not going to be accurate."

Geert: No, it's indeed being able to visually recognize it at all, but it's also often a quantification task along large areas, right? So also these percentages, yeah, you can do it on 10 high powered fields, but yeah, you cannot do it across the entire slide, at least not reliably. So it could also be that it's something that you can visually extract, but it's not reproducible enough-

But not quantify-

... to be useful.

Quantification-

Aleksandra: Yeah.

... is part of what flows into the whole result of the network?

Yes. Oh my goodness. There's lot to it. Thanks so much for explaining this. So now, on another note, how are [00:47:49] these approaches making it into the clinic? Is your company working on incorporating this? Because obviously, in the research environment, in academia, you can play, you can do whatever, gather a lot of data, but how do those approaches get translated? Because you have such great approaches that almost match the annotations. What I'm seeing, everyone is still doing annotations, because that's the easiest. That's pathologists, who pathologists is the ground truth, whether it's a good one or not, but it is a human trained Yeah, I-

... does this get translated?

Geert: Yeah. I think right now for these approaches, it's a bit of a battle between, so I think the accuracy and the performance is nearly there. I think like within this year, probably you will have methods that are as good as working with fully supervised, fully annotated [00:48:49] slides. Then probably you need still more data with the weekly supervised methods than with the fully supervised, but I think performance-wise, you can get close enough. I think then there is a second question on we call explainability and acceptance by clinicians pathologists. For example, PANDA challenge, which was a challenge on prostate cancer grading, we ourselves developed an algorithm that was based on mostly, is fully supervised annotations. That worked really well and the advantage is, that you can also visually show the predictions for the individual glands to a pathologist.

So it has some form of interpretability, but within that challenge, actually the best algorithms didn't do this at all. They just used the weekly supervised approach where they only use the slide and not the labels. The question is, will pathologists accept an algorithm where a slide goes in and a Gleason rate or whatever prognostic [00:49:49] marker for the patient comes out without any interpretable stage, right?

I guess at this point, the answer is no?

Yeah, exactly. So I think the main reason why not all the companies are switching to these weekly supervised methods is that it's, the acceptance in the clinic of fully supervised methods at this moment is higher. So what we have to solve as computational pathologists, let's say is this explainability part. So how can we give pathologists insight into what these weekly supervised methods are doing and increase the trustworthiness such that they are accepted in clinical use, but also in the toxicological pathology community? I think you would have the same questions.

Aleksandra: Yes, definitely. Definitely. So we mentioned a couple of surrogates, like the attention, if it matches what is in the [00:50:49] morphology. And I think there is enough applications where it does, but also plenty of applications where we don't know, or like don't want to make this false interpretation. So yeah, a lot of exploration still. So…

Yep. I think it's also a discussion that we need to have, like, for example, here, so you show attention maps. Do you think this is acceptable? What would you like to see? Do you need... So in some projects, we are now looking at report generation, really like generating text that describe what an algorithm is seeing, but I'm not sure for the people watch this podcast if they've seen the DALL-E network that was developed by open AI recently, where you provide a string of text and then it generates an artwork for you. Highly recommend people looking that up.

Geert: I haven't, but I'm going to link to this information about it in the show notes as well.

Yeah. So I think generally for me, what I think the highest, the holy [00:51:49] grail of explainability would be is if a neural network would just write down in common language-

What it's doing.

... everybody can understand what it's seeing. I think we'll need some time to get there, but if you look at what the big tech companies are now doing with natural language processing, I think we're getting quite close to having these techniques.

Aleksandra: Okay, thanks so much for explaining this. There is a lot of things to explore still and the field stays interesting. So there is still an exponential growth of solutions trying to hack it, trying to get around the logistical and technical limitations so it definitely stays cool. Thanks so much, Geert.

Geert: You're welcome. And then I'll save the unsupervised learning for next time.

Aleksandra: Yes. We're going to meet again and then again, to talk about the company.

Geert: That sounds good.

Aleksandra: Okay. Thank you so much.[00:52:49]

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript