OK, I have to consent to being recorded. OK, I consent. Excellent, right. So just by way of introduction. My name is Fergus Boyle Escarpments and I'm a research software engineer in the Oxford Protein Informatics Group here in the Department of Statistics. And prior to that, I was a student in the Department of Statistics. So I've been around for a while. And I'm going to talk to you a little bit about the use of machine learning in drug discovery and.
I'm going to touch on quite a quite a few topics, both things that I worked on and things that other people have worked on, but certainly the latter half of the talk is really going to be focussing on examples of research done either by myself or by other members of the Oxford Project Informatics Group. So this is by no means intended to be an exhaustive discussion of all of the things that people have done using machine learning and drug discovery that would take an entire course.
And you probably still wouldn't do all of it. But I hope that this will give you some idea as to how computational methods in general. But machine learning in particular is really benefiting the drug discovery process and what the practical implications of that are. So just brief overview I want to talk about. I'm aware that this is very much a statistics audience.
So I'm going to open with just an introduction to the drug discovery process, what it what it entails, why we care about it, why we might want to apply computational methods to it at all. And then I'll briefly introduce the concept of computerised drug design and discuss sort of prior to the machine learning hype, train or revolution, depending on your stance on it, what sort of computational methods have historically been employed?
And then I'll give some examples of sort of well established machine learning techniques and drug discovery, what they use for what sort of problems they accomplish, why it's beneficial over other methods. And then towards the end, I like to spend some time highlighting recent developments in drug discovery that have really benefited from recent advanced advancements in deep learning techniques.
So just before I start, I'm aware that quite a few people in this audience are either members of OPEC or have been through the doctoral training centre in some capacity. So I'm aware you may have seen the introduction to drug discovery talk anywhere between one and 50 times before. So if you want to type out for the first ten minutes, I won't be offended. So just get us started. What's what is what is drug discovery to really answer that question, we first need to understand.
So what what is a drug we intuitively think of a drug is, you know, it's a medicine. You know, we take a medicine to cure ourselves. But what war actually are drugs or biological processes? Infection and disease are just the result of the behaviour of macromolecules in the body. Proteins perform pretty much every task in the body to make. Proteins have specific functions if they're carrying out function and correctly, the body is doing OK.
If a protein starts to misbehave or functions too much or too little, or a foreign body like a virus introduces a foreign protein into the body. Bad things can happen. This is how you get diseases. This is how you get symptoms of infection. So the key to trying to treat or manage diseases or infections is really trying to figure out what is causing the problem and how can we either make make up molecule behave properly or stop that molecule doing the thing it's not supposed to do.
And so in pharmaceutical research, we have this concept of a drug target and a drug target is a key molecule that's typically a protein or occasionally a nucleic acid that has been implicated to an infection or disease. And this can be a protein, as I said, a protein in the body that's misbehaving or a protein that's part of, for example, the life cycle of a virus.
And I'll get to an example of both of these in just a moment. And there are all sorts of ways of identifying and validating whether a target is indeed implicated in a condition that I'm not going to go into today, that that's really a topic of research in and of itself. But the key idea is the in order to treat disease, we want to target a usually a protein, occasionally nucleic acid in the body and alter or inhibit its function.
Now a drug in and in and in pharmacology, at least a drug is any molecule. The interacts with a drug target in order to obtain therapeutic effects. And that therapeutic effect could be mediating a condition, managing symptoms, restoring function of a protein. It could be treating an infection by disrupting the life cycle of of a pathogen. Anything like a. a really a broad catchall.
Now, just to just to sort of distinguish between different types of drugs, because it's an incredibly broad umbrella term, I'd like to just distinguish between two key fundamental classes of drugs, and those are small molecule drugs, things such as paracetamol, you know, anything that you take in tablet form, for example. And these are small chemical compounds that are typically produced by chemical synthesis.
And in contrast to this, we have a class of drugs known as biopharmaceuticals, which are an incredibly broad category of drugs that are extracted or synthesised or obtained from biological sources. And the obvious topical example of this is a vaccine. And these can be, you know, very potentially very large molecules like, you know, an antibody is an entire protein is much larger than something like a paracetamol molecule. Today, I'm going to just focus on small molecule drug discovery.
But just be aware there is a whole there's an enormous field of different applications of computational methods in medical research. The example of Walsenburg is a target. How does a drug function? I'd like to start start with an example of a protein in the human body. This is a protein called thrombin thrombin. The grey structure on the right is is an experimentally determined structure of the protein, thrombin from an enzyme that acts as a catalyst in the blood clotting process.
So, so and blood clotting of this entire cascade of biological processes that results in blood cells are aggregating, which obviously, you know, seals wounds, but also when it's behaving, when it when it's misbehaving, you get conditions like blood clots, thrombosis, strokes. So it's something that we need to be very aware of. An example of a of a drug that the target that targets thrombin for a therapeutic effect is a peptide known as Hayrettin.
And the structure of this is shown in the naturally occurring peptide that's produced by leaches, which, as we know, feed on blood. In order to feed on blood, they need to prevent the blood clotting. The salivary glands naturally produce a peptide that binds to thrombin and stops. It stops the thrombin molecule interacting with other things because it's already interacting with the heroin and thereby preventing it from from from from catalysing the blood clotting process.
And this makes and this makes heroin useful as as an anticoagulant for treating both this and indeed several anticoagulant drugs on the market are based on Hebridean or chemical derivatives of the. The second example of the pathogen, so, for example, I'm going to use here is the human immunodeficiency virus, HIV. A key protein that plays a role in the HIV lifecycle is a protein called HIV. One protease. Now a protease is. And again, this is.
An enzyme that that breaks up a large chain of amino acids into distinct subunits. This is important for the life cycle of HIV protease because the proteins that are involved in the lifecycle of HIV proteins are produced as a single amino acid chain. So you have multiple proteins all together. In order for these proteins to be functional, they need to be split up into independent units and that is the job of the HIV proteins.
My pointer has this sort of groove or channel in the middle, and this is where it sticks to the peptide chain and breaks it up. Now. In order and so the way antiretroviral treatments for HIV work is by inhibiting the function of HIV protease, thus preventing it from breaking up these proteins and therefore, you know, disrupting the life cycle of the virus, the way this works is an inhibitor like adenovirus.
And this is the molecular structure you see on the right is designed to to bind in that in that groove, in that in that binding site, on the on the protease. The protease can't do anything to the of molecule. It can't kleve it like a peptide chain. And so it just stays stuck in there preventing the HIV protease from sticking to the peptides that it's supposed to be cleaving, thereby preventing the drug, disrupting the life cycle of the virus.
So those are just two examples of very different types of drug targets that we treat using molecule drugs. OK, so that's what drug is, so how do we actually develop drugs in practise? So if you've been to any drug discovery talks before, you'll have seen a variant on this diagram in one form or another. And so the pharmaceutical development process is sort of the first thing to understand is a very long winded, a very expensive process.
And an enormous amount of time is invested just getting from identifying a target to having a candidate drug binding against that target. That that initial phase is known as drug discovery anywhere from a couple of years, up to over 10 years, with an average of around four years across the UK pharmaceutical companies.
But even once you have such a candidate, you then have to go through several stages of preclinical animal models and clinical trials in order to verify that the drug works, that the drug's safe, that the drugs are effective enough to warrant any potential side effects. And each of these steps can take between one and two years and cost millions of pounds. And so you may end up with that from target identification to actually having an approved drug on the market.
And it can take in excess of 10 years and cost well in excess of a billion pounds. And so this early stage drug discovery process where you developed drug candidates, which is what we're really going to focus on today. And so the drug discovery process, once you have a target identified, is a cyclical process of starting from a collection of compounds that you have access to.
You can buy, you can you can make in the lab, somebody else can make for you whatever, taking your library of compounds and screening that entire library or a section, that library against your biological target compounds in that binder that talk about all the hits. So first you're trying to just identify hits and then in subsequent stages, you then take your your initial hits and try to optimise their of both their affinity for the target.
How strongly they point to that target and also the selectivity so they don't bind to other targets. And side effects of medicines are often caused by a molecule also interacting in some way with another protein other than the intended target uptalk effects. And so try trying to balance affinity and selectivity is a really important part of this process.
And once you have a molecule that you think has satisfactory affinity and selectivity, you don't have to go into a further process where you optimise other desirable pharmacological properties, for example, ensuring it's not toxic and showing that it doesn't aggregate all the while retaining the desired affinity and selectivity. And the diagram on the right just gives and really emphasises the research.
You identify some pets, you check for toxicity, you optimise that, you optimise that, you check that you can actually make the compound because it doesn't matter how good inhibitor it is if you can't synthesiser and know, could take many repetitive cycles. And so it really is a very long, very expensive, multifaceted process.
So just an idea of how this is done in practise, the initial identification stage has traditionally been performed in a process known as high throughput screening, where you have robots in a lab, the rapidly test or assay very large numbers of chemical compounds against the biological target of interest to see if any of them bind at all.
And although advances in technology and methodology have continued, continually increased the efficiency, the speed, efficiency and reduce the cost of this process, high throughput screening in general business, even if you can do it very quickly with certain set ups, you need the you need that set up in place. You need the resources to do it. You need the expertise to do it. And so it's it's an incredibly it's an incredibly laborious process.
And so you can start to understand why drug discovery is such a slow and expensive task deeds. So something that you may have read headlines about at various points is there's a well-known productivity problem in the pharmaceutical industry. And this is you know, it's been observed that despite continuous advances in technology and research methodology and increasingly available resources,
the productivity of the pharmaceutical industry has continued to decline. I just put some solid numbers on that. In 2012, a paper by Scandal et al showed that ever since 1950, the cost of bringing a new drug to market has doubled roughly every nine years. And indeed, if you look at more, that's continued from 2012 to 2021. So it's really quite terrifying. And there are a myriad of reasons for the same part.
It can be attributed to more marketing issues of this phenomenon known as better than the Beatles. If you're designing a new drug, it doesn't just have to work. It has to work better than anything else that we have. And it has to work sufficiently, better than anything else to be worth the investment, to be worth making, to be worth marketing.
And in addition to this, for a very good reason that I'll get on to in a moment, we've been observing increasingly stringent conditions, requirements from from from government regulators to ensure the safety and efficacy of drugs. And these really are sort of fundamentally can't address by just throwing computers at the problem.
However, other problems, such as inefficient resource allocation, you know, just brute forcing by throwing money at the problem certainly contribute to this productivity crisis and really try and optimise the process to bring costs down. Just an aside on why there are very good reasons for having regulations in place that that increase the cost and time taken to develop a drug is a historical drug called thalidomide that you may have heard of.
Now, thalidomide was initially marketed as an over-the-counter sedative in the late 1950s in Europe of things like insomnia, anxiety and such like. And initially, it was noted as safe for use in pregnancy, however, glowing evidence in the late 1950s, early 1960s linked thalidomide to birth defects in children of mothers who had been taking thalidomide during pregnancy has led to most countries withdrawing its use in the early 1970s.
However, precisely due to a lack of clear regulation, it remained in use in Spain well into the 1970s and possibly estimated. Anywhere between 10 and 20000 people are now affected by the horrific birth defects that were caused by the misuse. And it's really because there was really no regulation or formal requirements for proving efficacy or safety and drugs in the 1950s.
Now, in the aftermath of the thalidomide tragedy, many countries introduced stricter regulations for drug testing and approval. So, for example, the U.K. Medicines Act of 1968 that required all current and future drug inefficacy was a direct consequence of the thalidomide disaster.
And just put some numbers on this in. In the late 1950s, early 1960s, when this was happening, there were there were on the order of 30 to 40000 drugs that were available in some form in the U.K., legally available in the U.K. by the start of the 1990s when this regulation went, when all of these drugs had finally been tested in accordance with the Medicines Act, only 5000 licenced and approved for use.
So it's a really terrifying number of drugs that were just thrown onto the market with no real care for if they were if they were safe. So there are very good reasons that we can't that we can't just try to cut back on the clinical trials phase, we can't save time, that we can't save money there. So what can we do? Well, it turns out the very few candidates that enter clinical trials make it to the market with most failing due to lack of efficacy or safety concerns.
And this in itself contributes enormously to costs because a successful drug has to not only pay for its own development, but for all of the work, the optimisation, the development that went into the drugs that did fail. And so one thing that we can do to try and address this productivity crisis is to try and replace the expensive and laborious steps preceding clinical trials using computational methods and the really two aspects of this.
The first is reducing the cost of designing the drug candidates by automating processes for both. Improve the quality of the candidates that enter clinical trials, for example. Can we predict beforehand that a molecule is going to is going to be toxic?
That immediately allows you to remove things from the clinical trials pool? So this brings onto the concept of computer aided drug design, and so it refers to any of a set of computational methods that are used in the in the preclinical drug discovery process in order to identify, identify and develop your compounds into clinical drug candidates.
And fundamentally, the goal of computer aided drug design, or CAD, as is often known, is to predict what just to predict whether a molecule binds to a biological target. If so, how strongly? And so it's sort of an analogy to the high throughput screening I mentioned previously. The process of applying computational methods to screen a large compound library is known as virtual screening. And just like traditional lab based drug design, this is an iterative process.
You perform virtual screening, then go and try and optimise your hits from virtual screening and then go back to a computational method to see I think it still binds. And so, again, it can carry on for quite a few iterations. And so in this talk, I'm really going to focus on the virtual screening task for much of the talk.
But and I'll mention this a few times, computer models have been successfully used for all sorts of tasks and computerised, you know, analysing properties such as how is how, you know, trying to model how is a compound going to be metabolised? Is it going to be toxic? Is it going to. Is it going to aggregate all sorts of things? I'll give a few examples of this later on.
So just to break down what virtual screen screening entails, you can typically break down virtual screening into two types of approaches. The first of this is looking based virtual screening where you're using methods that are entirely based on the the chemical properties of of of of your of your molecules. And look at these virtual screening in the process of saying, OK, I already have some Leganes that I know by my target of interest.
So can I use that information to screen all of my other compounds to see if anything else is also likely to bind my target of interest? Nor do I have anything that similar to things that I know interact. So if you have if you have some known negatives for a target, you can directly go and apply liquid based methods.
In contrast with this, we will also have structure based virtual screening, which instead uses information about the 3D structure of the biological target to predict not not only if a molecule will bind, but if so, where, how? How is it going to bind? What interaction does it make and how strongly does it bind? And so these two methods use very different forms of information. If you have no inelegance, you can you might use a ligand based method if you don't have any known Leganes,
but you do have a 3D structure of the protein. You might use that. And of course, there are some some quote unquote, hybrid methods that combine these two approaches when you have both of those forms of data available. And just give an idea of what living based social screening entails.
If one of the key concepts in sort of comparing screening molecules computationally is we need a way of representing a molecule in a computer and given given such a representation, a way of rationally comparing the similarity of molecules, if one example of how this is done is is a technique known as molecular fingerprinting, the idea is we have we have we know the structure and the composition of our molecule.
And on here I have the example of this is the tutee structure of paracetamol molecule. You've got different molecules, are very different sizes and shapes, so it's not necessarily clear how to compare them analytically. Molecular fingerprinting is a concept that looks at the structure of the molecule and what features are present in this molecule, what atoms are next to each other, what groups are next to each other and converts?
Molecules are potentially very different size and composition into a fixed length, finite size vector, which each bit encapsulates a certain functionality or chemical part of the chemical structure.
And once you have this sort of vector representation, you're you're you're now in a very good position to take any method of earning back to the point where similarity searching, for example, is to compute these sorts of bit vectors for all of your molecules and then just compare your library of compounds to the fingerprints of your own molecule using a metric such as the Jaccard autonomous coefficient or something else like a similarity score.
So the sort of approach is known as similarity searching, and it's just one way of representing a small molecule in a computer. Is this done in practise? So this we have structure based virtual screening where instead we're trying to make use of the 3D structure of the target, so we might try to explore how the molecule might interact with the target and are sort of two main contrasting computational techniques that might be used for this.
The first of these is a technique known as protein, like a docking where you try to rapidly sample possible bound confirmations of the ligand. So try to see how do I think it might bind. You might use, for example, a Montecarlo search algorithm to do this and then try to rapidly estimate the binding affinity using what's known as a scoring function or go into more detail in just just a little bit.
And the key idea is you're trying to do this quickly because you have millions of compounds to screen. Now, in contrast to this sort of rapid fire approach, is known as molecular dynamics, has all sorts of applications.
But in this context, first locations of the protein laden interactions to try and predict how how is where does the molecule want to set in the active sites of the place and how to set off the simulation and try and let it decide where it wants to set and based off of those you can try and gain an understanding for, OK, what are the dynamics of binding? Because this is fundamentally a dynamic biological process, not a static snapshot.
So you really want to understand those binding dynamics and from that to try and compute again using using force fields, trying to actually compute the interaction, energy or the binding affinity between the protein and the ligand Hierophant ligand, something that has a greater change, infringe upon binding that makes it more tightly bound. It doesn't want to separate. And that's what you're looking for.
And in a binder, in a drug, it so protein Lewandowsky is much faster than molecular dynamics. It can be orders of, you know, orders of magnitude faster, depending on how you configure them like that. It faces this dynamical information and sort of detailed, accurate, free energy calculations for speed. And so this is always the Trade-Off that you're making when you're trying to trying to efficiently screen large numbers of compounds.
And in practise, protein like in blocking, is the most common starch based technique used in drug discovery. Just because it's efficient, you couldn't use molecular dynamics to screen millions of compounds. That would be crazy. So more on protein like in dockings. One of the active areas of research is particularly trying to use machine learning to improve things.
So protein looking, docking, in addition to its search algorithm, it uses, as I said, what is known as a scoring function, which is an approximate a quick, dirty, approximate function that tries to estimate the free energy of binding based on a single static snapshot of this is where the protein is. This is where the ligand is. And this enables it gives you a quick estimate of what you think the ELR rapidly assess the poses predicted by the docking algorithm, decide.
Do I think this is a reasonable pose? How strongly do I think it by and can I rank all of my different Lykins by how strongly my scoring function thinks they bind? And so that lets you prioritise things that you think bind more strongly. There are many, many, many pieces of docking software that are regularly used for this process. They all have different strengths and weaknesses. I'm not going to name names here in case I upset certain people in.
So that was that was all a lot of theory, but just give an example of what a protein like result might look like in practise, if we go back to our example of from an inheritance and we have the structure of thrombin and grey and we have an experimentally determined binding pose of the Hebridean molecule and Siân, and this is determined by X-ray crystallography. Now, just using a talking algorithm to try and sample that binding pose. The best result that was returned rythm is the pose in magenta.
And you can see that a lot of the structure a line aligns very strongly. Apart from on the left, we have one ring that sort of clearly out of place. This is an example of the sort of quick and dirty dorking scoring process that gives you a rough idea of where the molecule sits. And based on that confirmation, your scoring function will give you some estimate of of the free energy of binding.
So just in practise today, what I just might look like is it is really quite a key component of this process. Scoring function is is any sort of any sort of approximate method that tries to estimate the free energy of binding. And sort of classically in Dorking, this is done as a sum of physical or empirical energy terms that are the key being that they're all easy to compute rapidly.
And this might include, for example, terms that represent Vanderveldt potential, terms that represent a more electrostatic potentials, terms that try to quantify the energy of of hydrophobic contacts, of hydrogen bonding terms, all sorts of things like this that might go on in molecular interactions. I know quite a very common thing to do is to find some of these terms, approximate them quickly, and then just use a linear regression to assign weights to each of these terms.
That gives you the best estimate of binding affinity that you can compute rapidly and is really a multi-tool of structure based drug discovery they use to determine whether a pose is physically reasonable. That used to rank Leganes by the likelihood of binding and used to try and actually predict the strength of that binding of the binding affinity of that leg and the real use for a lot of different tasks. And that brings us on.
And so that's sort of a very brief overview of the sort of techniques that are used in computer aided drug design, particularly in virtual screening. And so with that in place, I'd like to finally talk about how machine learning methods are being used in drug discovery, particularly for this virtual screening process. The context, text, statistical modelling and machine learning are well-established tools and drug discovery.
And I could give you an exhaustive list of things that people have done in the past 30 years. But just a few examples. Using representations such as molecular molecular fingerprints we introduced earlier as features for support vector machines has been successfully used for a virtual screening,
for example, by Jabat Atoll in 2018. An interesting example of of sort of substituting secondary assays with computational methods has been the use of decision tree classifiers to try and predict whether or not a molecule passes the blood brain barrier, which is a very important, a very important topic in pharmacology.
And just as a third example of this caution under in 2006, a very good paper on the use of naïf based classifiers to try and predict whether a molecule, whether a molecule is likely to be toxic or not. So are just a few examples of the things that people have applied machine learning to sort of historically and in this field.
But in recent years, there's been a lot of a lot of interest in the use of machine learning and drug discovery, and arguably one of the big reasons for this is the ever increasing quantity of data that's actually available, the ability of traditional methods such as, you know, using a linear regression to fit a scoring function. It simply can't leverage all the data that's available. So just give a feel for the sort of data that's available.
Some of the publicly available databases might be a database known as Zenk, which contains 230 MIRTHA available compounds with 3-D confirmation of the and ready to use and dorking and a further 750 million compounds that are known to be commercially available. The idea being that you can take your computer, Zenk, you can screen them and you know you can go and buy them somewhere else. Another example of this is a database known as Cambell, which records some biological assay data.
So measuring do things interact, which contains around 17 billion for its biological activities? You know, how strongly do things bind for two million different compounds across around 14000 targets? That really is an enormous amount of data that you might use to try and fit some predictive model. And thinking about structure based drug discovery, a database known as PTB.
Binde is the largest collection of solved structures about protein living complexes that contains around 18000 of these of these complexes ready to bolts. So that's a lot of data. But just just just to give a feel for. Whether this data really represent is representative of some of chemistry and indeed if this data can be representative of chemistry.
What one thing that's quite interesting to do is to say, OK, I know what properties a drug like molecule typically exhibits, and it's puts constraints on the base of the molecule. Based on that, you can you can sort of use combinatorics to estimate what is the size of the space of molecules that could possibly exist and be drug like. And a very common estimate of the size of the space is ten to the power of 60 molecules.
That is enormous. That is impossibly enormous. Just to give an idea of how impossibly enormous that is, you can do if you're if you're boring like me, you can sit down and do a back of envelope estimate of how many atoms are to revive a figure of around 10 to the 57 atoms in the solar system.
There are there are potentially a thousand times as many drug like molecules that you could possibly be interested in as there are atoms in the solar system is a physical impossibility to make all of these molecules. So this is a very important question of is the state can this data be relied on to be truly representative?
The answer to that is. It's a question as something that needs to be borne in mind, but nevertheless, nevertheless, the availability of this data has really spurred a lot of use of machine learning. Of course, machine learning methods require robust validation is not enough for a linear regression on one hundred data points to test on another hundred anymore.
Some examples of data sets that been used for this in drug discovery is, for example, a database known as the Director of Useful Decoys, which consists of 102 different protein targets for the 2000 ligands spread across those targets and around a million of what are known as decoys. And these are molecules that are believed to not bind to those to those targets.
The idea being that you now have large data set that simulates the real world situation of a large compound library with a small number of binders. And you could use this to test your algorithm to see does it rank the binders more highly than the non binders? One of the obvious issues of this is potential biases, the identify decoys.
And several people, such as Rhorer and Bowmen in 2009 have come up with various ways of ensuring that ligands are embedded next to decoys in chemical space to make them hard to differentiate.
But it's very much an ongoing area of research and and finally, specific to the task of actually developing a good scoring function, Tchang, at all in 2009 started what's known as the comparative assessment of scoring functions, or CASSER to five percent of PDV derived by derived complexes where the Fanti measurements that you can use to directly measure how well does my scoring function predict binding affinity. And it's sort of become a de facto standard in the field as something or crop up.
OK, to focus on the on in particular, on the use of machine learning to develop scoring function, because it's something I've worked on over the course of my DFL, is still very much an active area of research just to just establish why the why in particular, this problem has drawn a lot of attention. The classical scoring functions used in Dorking are often very good at saying whether a predicted binding pose is good and identifying blinders over non binder's.
But the energies that they estimate often completely fail to correlate with the actual experimentally observed binding affinity. And so their application and actually measuring affinity is incredibly limited.
Now, in the last decade or so, starting around 2010, any different machine learning approaches have been shown using all sorts of situations, and algorithms have been shown to consistently outperform these classical scoring functions at the Affinity Protection Task on common benchmarks sets such as Cassie and Just Emphasise These all relied on engineered features such as counting.
How many pairwise interactions are there between atoms in the protein and ligands, or fingerprint's describing those protein legate interactions so that it's there and hearing those features. And in addition to this, a lot of these methods appear to be strongly dependent on the data they're trained on and often generalise poorly to unseen targets, which is not ideal given that in the real world we're trying to screen Lukins against a potentially novel drug target.
And the scoring functions, although primarily optimised for predicting affinity, have been applied to the virtual screen and classification task of identifying binder's. But again, the. Form on an unseen novel Target. And finally, it's quite an important concept here is most of these studies have relied on training and validating, using only experimentally determined binding poses of ligands determined by crystallography.
And only a few have explored how models can be expected to perform on posters, even though in reality, in a virtual screening campaign, you don't have crystal structures of all of your approaching complex. Because if you do, you're fine. So you don't need to screen them. And this leads onto some of the works I did during my DFL, and the first thing I'd like to just briefly mention is one of the things we looked at was aggregating structure based on ligand based methods,
using random forests. And just want to figure out, I just just a brief illustration of don't worry about the details. The solid lines indicates the correlation obtained by method combining structure and lingonberries information that all lines indicate the corresponding method using the structure based information.
We found that regardless of how you train and validate the model, a model using both structure and ligaments, the information was consistently superior at predicting the protein in binding affinity. However, regardless of the features, an algorithm used the same again. As mentioned previously, the similarity between your training and validation data had a strong influence on your model performance. This is a problem that clearly needs to.
One of the common criticisms of machine learning is as a somewhat earned reputation as a black box, it's not entirely true, though. An advantage of the rain forest algorithm, for example, is the ability to actually look at the importance of each feature in the model.
And indeed, in our work. What we found when we inspected this was that, again, regardless of how you're training the model, both ligand based and structure based information was consistently found to be important in making these predictions. So on the right, the red and yellow importance of structure based features, the blue bars of the importance of ligand based features. No matter how you train the model, you consistently see this combination of features.
Being important is the best thing that capturing useful orthogonal information. I think that we looked at was this problem of, well, how do we do in the real world on Doctorow's, it's rather a crystal process allowing for the fact that the poses might not necessarily be that accurate. And again, I worry about the details, but you're right, the solid lines correspond to a model that was trained and validated using experimentally determined poses.
The dotted line is the same model that was trained and validated using DOT poses, some of which are really good, some of which were not so good. And again, what jumped out to us was, firstly, regardless of how you trained and validated the model, the model using crystal poses was always performing better than a model using Doctorow's as sometimes very little, sometimes by quite a lot.
So it clearly has an impact on the model, which gives you an overly optimistic estimate of how you're going to do in the real world. But in addition to this, what we also found was that the relative to often performance of a hybrid method that was using both structural and ligand based information was much smaller when using doorposts than that of the model, using only starch based information, which intuitively makes sense. Information about the ligand is independent of that pose.
So appears that it helps to actually compensate for errors introduced using these imperfect poses. And indeed, we had to look at what happened when we provided multiple training examples of the same Lincolnton different. Not only does seeing different poses help or hurt the algorithm, and what we actually found was the one the performance of the model, regardless of how it was trained, consistently dropped when it was given multiple examples of poses for a ligand.
But also when this happened to the ligand, based features in the model became far more or less dominated in its ability to make predictions. So corroborating this idea that when you have noise introduced through docking errors, you leveraging ligand based methods that, you know, work really can really help to recover your performance. And, you know, this illustrates the advantage of using a more interpretable algorithms that really allows you to drill into the model and see what's going on.
Why is your performance being affected? So as to what we've observed and others have found is that your performance often depends greatly on the target of interest and that generalising to novel data can be incredibly challenging, even if your model appears to perform well, which is really quite, quite damning when you think about the real world application to a novel drug target. Well, we also found was that.
Performance gain on standard benchmarks like Casarett when you add more training data could often be attributed not to having more data, but the data you add being similar in some way to the data in that benchmark and that benchmark set in a different Leganes binding against the same protein. And as soon as you remove that similar data, even if you have more data, you're back to where you are with with the smaller data set.
So it's just sort of an artificial performance gain that's masking what's really going on. And this is just like to sort of leave this there. But this is a really glaring problem and one that clearly can't be addressed by simply getting all better on a standard benchmark. And this is an active it's still very much an active area of research, so maybe start a little bit late.
So I'd like to just sort of quickly introduce some recent developments in applying deep learning to to to drug discovery, both to again, to this virtual screening task, but also some some really interesting ideas about molecule generation that's been enabled by by by deep learning techniques. So clearly went all the way there, even with machine learning, scoring functions, they depend on well engineered features. Even good features can introduce human biases.
What we'd ideally like to be able to do is find some way of taking a raw representation of the data without human bias and getting a loan for itself. What a molecule looks like, what an interaction looks like, what a bad interaction looks like. And this. And this is. The actual application of deep learning, you know, you let the model engineer features for itself in a hierarchical manner.
There have been all sorts of applications to deep learning, to taskin drug discovery of all solubility production, toxicity prediction, predicting reaction outcomes for synthesis planning. Again, going back to the idea of it doesn't matter how good the ligand is if you can't synthesise it to test it, molecular design using reinforcement like so iteratively modifying the molecule to make it better and also improving dorking and virtual screening results.
And in the last few minutes, I'm just going to talk about some recent work from DR Protein Informatics Group that's touched on applying deep learning to drug discovery. The work that came out of the group was a piece of work by a former circus embroiled in 2018 where they took a used a convolutional neural network.
Can you take that? You can take a 3D structure for like a complex, split it up into different atom types, for example, or, you know, where the allostatic carbons instructions are in the protein and the liquid and generated from the sort of voxel maps of the density of these atoms in that structure. And then you can you can treat these voxels maps as analogous to colour channels and an Ojibwe image. So in an Ojibwe image, you have three benchmarks, one for red, one for green, one for blue.
That builds up the whole image. Here you can have these maps of where different atom types are that together represent a full protein complex. And with with this size 3D representation, you know, you now have a representation of the data that feeds quite naturally into the convolutional neural networks that have been applied successfully in computer vision tasks,
image recognition, video processing, things like this. And this was shown by David COAS at 2017 to be quite effective for virtual screening. The piece of work the Fergus Emery did was taking these and applying architectural advancements from computer vision, in this case using densely connected layers in the network to see to see if this improved your ability to screen compounds, as it had been shown, to improve the ability to classify images.
And indeed, they found that by introducing these densely connected blocks, exactly as you would in computer vision, you immediately got an improvement in your virtual screening performance, suggesting that this task does work.
There is a computer vision task. However, some early analysis of these these sorts of CNN's revealed the oftentimes the CNN was actually just using the channels that represent the ligands to make its predictions, know it's implicitly learning biases about ligands, even though it's not been given engineered information about them. And it wasn't actually using the structure of the protein.
Some work by a current member of OPEC, Jack Scantlebury, however, showed that by taking their training, augmenting by taking their known binder's, repositioning the ligand and obviously on physical poses and labelling these as non binder's, you force the network to classify them as incorrect and thereby force it to use the structure of the protein to differentiate between these two different poses of the bound ligand.
So what they found was that if you don't do this augmentation on the right, you you basically get the same information, the same result, whether you use the structure of the protein or not. However, in this figure, in the middle, what they find is if you use if you do this data augmentation, you force the model to actually use the information about the structure.
Your model generalises a lot better. And so this looks like this the figure in the image on the bottom, the third image on the bottom row here have a ligand sitting at an active side of a target and read read parts of this image correspond to parts of the structure that were masked during the test process. And so the model wasn't able to use the areas in green contributed favourably to prediction areas and very unfavourably.
What we see is the network was augmented by the favourable hydrogen bonds indicated by the yellow dots and the green areas positively. So it's learning to make the prediction for the right reason. However, and the other images where you're not using this form of data augmentation, you're still getting the you're still getting correct predictions, but you're making these predictions for the wrong reasons.
You're not actually learning that those interactions are there. So clearly, this is a key component of training these models. And just again, consciously start a little bit later, I'm going to gloss over this little bit at the end, but something that I've been working on recently with a party student, Oliver Turnbull, in the department is this idea of, well, OK, we've seen that this sort of data augmentation helps you model generalised for virtual screen and classification.
Can we leverage that to perform better? A regression task now in regression. It's not clear how you would label a non-physical pose because you can't just label that as blind or no. You have to give it a binding affinity value and it doesn't make sense to assign that to a non-physical pose.
So how you do that data augmentation is not clear, but what you can do is use transfer to take the model that was trained for the screen, the classification task that's adopted that generalisability and then use transferring to fine tune the class. The final layer using a regression dataset such as PTB binds to train that final layer for the regression task and see if you actually retain the benefit of that data augmentation. And so this is something the Oliver looked at recently.
And so so what we have to look at is if you perform the same masking process using a network that was trained from trained, that was fine tuned from Jack Scantlebury network, you get something on the left where we have a ligand that we know binds with a certain binding affinity. And again, we can mask atoms and see atoms that appear as green. Here are those that contributed favourably to the final of energy production. And things that are in the red are those are contributed unfavourably.
What we see on left is when we fine-tune a model that's had that data augmentation applied, the models correctly, learning where important hydrogen bonds are. It's clearly it's clearly rewarding those interactions being present and penalising parts of the molecule that don't contribute to those sorts of interactions on the one we just train the same neural network from scratch purely for energy production,
no data augmentation. What we see, even though we got a correct prediction for this compound, what we see is that actually there's no real rhyme or reason to what atoms or parts of the protein the network thought were important for predicting binding affinity. For example, up here we have what should be an important bond is marked as completely unfavourable and not down here. What should be an important patch on the protein surface, again, is not is not contributing strongly.
So clearly, this network. Can actually retain some of that information from the virtual screening data augmentation process, when you when you find tune it for this for this regression process. This is this seems to be a really promising lead for actually improving your ability to predict binding affinity in this virtual screen setting. So I was going to talk a little bit about what sort of line I don't want, but I want to make sure we stop in reasonable time.
So I'm going to gloss over that. But if you want to ask about it, feel free just to just to sort of emphasise what we've learnt so far from from all the experiments that people have done on this topic is the machine learning methods at this stage are ubiquitous in drug discovery, and they often outperform traditional methods either in accuracy or accomplishing the same task faster and cheaper.
For example, virtual screening versus high throughput screening, computational synthesis planning versus having a chemist plan out the synthesis of every single compound.
And although they have a reputation as uninterpretable, blackbox is actually using using appropriately chosen algorithms or sensible approaches to exploring your data can actually make your model quite interpretable, giving you insight into how predictions are being made, whether the molecule is actually learning the underlying biophysics or simply spurious correlations in the data.
And so on that note, deep learning provided you perform careful data, augmentation and training of the model can enable virtual screening with a model that learns directly from the underlying data of what favourable interactions look like without inheriting human biases from engineered features, regardless of how sensible it might be.
And I'd like to mention, even though we don't have time to talk about it, is generative models such as arrogated neural networks, which are used to build up graphs. You can represent a molecule as a graph and use something like a graph gaited neural network to rapidly elaborate from that graph to to generate possible new molecules,
which gives you a new way of exploring. Coming back to this idea of chemical spaces and possibly vast gives you another effective tool for exploring chemical space in a way that a chemist might not be able to on paper.
All of this works also raise some important questions and challenges. And as I alluded to when I spoke about the size of chemical space, is this question of is our available data sufficient for our purposes and sufficiently representative, representative of chemical space to actually allow us to truly train generalisable models that can actually perform these sorts of tasks without inheriting human biases?
The important question is sort of underpins all of this is can we actually can we currently rely on protein to generate binding poses, are accurate and useful enough to enable all of the machine learning models to work effectively. A common maxim in machine learning. Of course, it's garbage in, garbage out. This most certainly applies here.
Another very important question that I alluded to previously was molecular dynamics is all of these methods for virtual screening and predicting binding affinity fundamentally operate from a static snapshot of the protein and complex a single doctor pose. But in reality, molecular interactions is a very dynamical biological process. You know, things are wobbling around. And so does it really make sense to expect to be able to solve this problem using a single static snapshot?
Or do we need to or do we need to explore this dynamic, these dynamic processes more? And finally, we've seen a lot of promising work with with deep learning methods, being able to not only screen molecules, but indeed generate new molecules. So can we expect deep learning methods to fully remove the need for slow and costly human involvement in things like design and synthesis planning? Or do we still need the expert human on hand in order to guide this process?
I just like to wrap up there, so. I'd like to thank the entire of the protein informatics group, which you can see all all looking very professional on the right hand side, but in particular, former and present members Fergus Emery, Tom Hatfield, Jack Scantlebury and all of a Turnbull whose whose workers underpin a lot of the things I've spoken about today, particularly on the deep learning side of things. And thanks to Garrett and Beverley for inviting me to speak today.
And thanks for all of you to for listening to me. So I'd like to leave it there. And we have some time available. I'd like to answer any questions you might have. Thank you. OK, thank you very much, Ferguson. That was wonderful, and I invite everyone to use your yellow or whatever preferred colour and symbols to indicate your appreciation. I will stop recording now.
