Veridical Data Science for biomedical discovery: detecting epistatic interactions with epiTree

00:07

This today's instalment of our Department Statistics, Distinguished Assembly series. I'm very glad to introduce to you today a most distinguished speaker. Binnu is Cassilis distinguished professor at the class of 1936, second chair in the Department of Statistics and Easy, yes, at UC Berkeley. She has won many accolades and prises. And amongst which she is a member of the US National Academy of Sciences.

00:38

A member of the American Academy of Arts and Sciences has president of the Institute of Mathematics Statistics Guggenheim Final. And fortunately for the UK, she is also on the Scientific Advisory Committee of the Tubing of our Alan Turing Institute for Data Science and I.

00:56

She was formally trained as a statistician by her research. Now extends way beyond the well, the realm of statistics, as you see in her top world, has leveraged new computational developments to solve important scientific problems by combining novels. That Discussion commission approaches with the domain expertise of her Manicka collaborators in neuroscience genomics in that position. Over to you. Thank you. Thank you for the very kind introduction.

01:27

Also might me. It's a pleasure to be here. So as I mentioned that I would just share some our work. It's kind of this framework which would call this. Yes. Which our engineers really divide up a word like the last 10 years and started with a paper by myself on stability. And, you know, now we're we're really without a shirt is this framework of murder to data science?

02:00

And actually, it's not just for medical research, but the particular example, our use of a case study is kind of come from I actually some attempt to look at cardiovascular diseases and looking for gene interactions or epi static interactions. So if someone you're wondering, what do I mean by vertical? I didn't either. This term was suggested by our colleague Jen Jen from Columbia. I to look it up to it actually means truthful. So it means coinciding with your reality.

02:32

And we kind of liked the name and we took it for my paper has like a straight principle that size, which is super long. And she and, you know, like mature learning people which have something short and sweet. So we got the paper now. We're reticle davison's. So a lot of the problems my group's been working with is Spermatic colourise.

02:55

You know, I did my had this huge, ah, four fold advance recently and you will see well get organised to really look at the electronic health data and we'll have data break. So have another inside track you're already putting praising the file. You see hospitals had a meeting recently. Now two years ago and everybody bias's all of that is part of precision medicine. However, a special case that A.I.s like nuclear energy, both promising and dangerous.

03:28

It's part of life and holds a lot of possibilities for us. But I think if we're not ragers and careful, then can also bring a lot of harm. Data science is open under the hood for many A.I. devices. Right. There's a lot more to it than then state data science emission. But it's like the heart or the brain of a lot of A.I. devices. And. Machine like machine learning sits at the interface of computer science institutes and maths.

04:02

And that was domain knowledge. You kind of have this is one diagram describing data science. Many of us kind of agree with these sites where they tried to combine data with stormy knowledge to make decisions and generally knowledge in the context of a particular problem. To formulate define a vertical data sites, let's say they have vertical data science, extra reliable and reproducible information from data.

04:30

But I also want to emphasise we need to really divine love and reach the technical language to communicate and evaluate empirical evidence in the context of humanisation, domain knowledge. We need your terms and but the new terms such respond to reality in different problems so we can use a language, communicate, already talking about reality, not give something up. Months. And the goal is really to realise the promise.

04:55

This mitigates dangers of A.I. old data science of machine learning statistics. So my view of data science, I think it's best if we think about science as a whole process, not disconnected steps, and think of it almost like a hardware system. Another, the concept would be very easy to understand which I describe. Of course. No, there's no data scientist are people. So here's a photo. Some my group members, your staff on Domain's suggest question and data collection.

05:29

Well, selection of other kinds. Because there are a lot of public repositories that we can access. But now you have less information on the collection and the decision is made a lot of heat here. Estimate how you like and how you clean. And then we might also have to contain the data after we. Gather data either from collectors or from databases and data exploration. And a lot of the emphasis has been put out modelling algorithm, which is very, very important.

06:01

But all the other steps are equally important. Even data cleaning, I mean, most of us know that there is a clean can take up to 70, 80 percent of our time, but we rarely talk about it. Um, problem fulmination, why this problem? Trust the testing coefficient, you know, linear regression being zero. Now, that's a job. The domain problem doesn't come with the name, although you make a decision in all the data that having the decision and then you do ad hoc analysis.

06:27

You communicate and you updated on my knowledge and many of these steps down in you. Maybe I'll do data cleaning. Well, I actually I formulate the problem. I go back to this very non-linear back and forth. And what's missing in this system, if we borrow Diderot's from no quality control, which the church was very successful, certainly we should standardise the process so that the results become more reliable and transparent.

06:56

So for the rest of talk it to the process framework for original data science to go to a case study in biomedical research to develop new trefoil at this Disick interactions. So the piece from there was developed. I mean, really, through my work and my with my collaborators and the paper came out with my former student Konkan here now at UCSF that last year. Now it's really to try to follow up on the O'Briens two cultures thinking and integrate the two cultures instead of.

07:33

Separating them and streamline and unify many good practises by many groups, including my group, and putting in both a conceptual and philosophical kind of coherent framework so that other people can bedside and take advantage of. So predictability very much is. Which has been at the heart of machine learning, of course, statistics. We thought of us prediction, too, but was narrative central. I think of a that people don't think about prediction and compatibility.

08:06

Of course we did computation by machine learning included at the forefront and stability. It's really expansion on the concept of uncertainty from sample to sample to a much. Bigger scope. That. We can deal with data perturbation data, Canadian innovation, problem formulation, perturbation. And with special like to classical statistics, if actually we have very well justified probability framework that something like random trials that we can go back,

08:38

you even there will have to clean. So it's not completely the stitchery inference. Really need to worry about important steps that they took. So this is intended to unify, streamline, expands on ideas and best practise from machine learning and statistics stability. If you look at my group's project, we did worry about stability that was not as consistent as we formulated a framework that things become like model. I would say a coherent and consistent across different subgroups in my group.

09:12

And why nothing of what you see, as you said, sexually really booting out a lot of passive work. These are signs to me as post signs that mesh and engineering predictability in poor parents philosophy, science is really important step for of us to vacation as the benefits really try to represent replication idea from a scientific research, computability, invest the scalability, convergence.

09:39

All of that from computer science, but also collectability that include data in spa assimilations, which is, I think, under in statistics. We should use the data to come out with things close to reality and then simulate it from that models and we'll come up. And so imagine the stability I kind of embarked on this realising and promoting stability while I had the opportunity now like once, 10 years ago, to give the Tukey lecture Opportunity Society so to come back.

10:10

What I was seeing was the sample instability of last year, neuroscience problem with to create robust statistics. So stability is a challenge. And Joe, robust as well. I w that he is has a particular meaning. So is value stability and also calm. I was applying massive dynamic systems. People use stability. So it's really you have a perturbation. You justifies your documentation. And then you look at the result about the stability measure to be raising tolerable threshold.

10:38

So it's about interprete ability that way. S stability. You shouldn't interpret your result. And reproductive rights and now had to expanded to all core scientific recommendations. So I've been working with scientists to reduce the scope of experimental space and fall both bad experiments and possibly in causal inference about intervention design. To be a little more specific, 2013, I define stability through this paragraph, reproducibility is imperative for any scientific discovery.

11:16

More often than not, modern scientific findings renounced the paper and analysed it. High dimensional data as a minimum reproducibility manifests itself instability of Stachel results relative to reasonable perturbations to the data model used. And of course, reason, though, that's the heavy duty work, that documentation kind of back up, coming back to this system view of data science, predictability is really a reality cheque.

11:46

We shouldn't anybody try to interpret anything unless we know key structures of the data is being kept. I'm on the island. There are many talks. People talk about hypothesis testing, ways of dressing, like why you can interpret its model and stability in Syria. You try to shake the system and take different parts out and continue another part that you could just use. Then the system doesn't break. It's really very common sense.

12:14

Those principles are very common sense, reality, cheque and robustness. And the pieces of workflow, so pieces consist of two parts. The workflow and also the inference. The workflow is really try to think about the same principles every step of the days assigned lifecycle. And recognising then would make so many judgement calls, human judgement costs in the process. And let's make that transparent and maybe perturb a bit to make sure the system still works.

12:50

So P you know, your model. Formulation stage really means. Future. New data. So we shall always keep in mind that when we develop not it's not just by the data we have. It's about some future applications and future diagnosis. And keep that in mind. And she tried to bring that in computations everywhere. Right. You know, you have the data. You do it's computer. And stability even includes language, stability that I had the opportunity to talk to a cancer expert from Berkeley Lawrence Lab.

13:24

And she has this concept in biological terms, kind of matrix in her micro environment theory for cancer, which actually took 30 years for people to accept that cancer is not just something shocking. So it's really like a micro organism. And I know we use Matrix for a table of data and do not try to explain to her that her matrix was different from my matrix.

13:50

No, I came across, but, of course, was a social occasion. So if I'm not working that hard, then that's something we have to make sure that, you know, in the context of both cancer research and analysis, we will make sure Matrix means the same thing to both of us when we speak about it. And that will come through the process entrance. It's really tried to expand.

14:12

This quo in France and use perturbation as a basic concept instead of a property distribution and make it transparent and pretty algorithmic. Again, it was specialised institutes in France. If the probabilistic model is bad, it then we go back to classical inference and data cleaning is now the issue. So the data perturbations, that's where it kind of the whole, you know, pieces really started with this very in-depth collaboration and had reject a landslide.

14:46

So about 10 tools, six took a whole year off from city history position and said in Jacob Landslips in your science lab to get into neuroscience and we're working and working on the movie reconstructions together. And after that, to interpret them on a fellow. This is a basic science I have to really see which features really drive a particular FMI of oxalate responses. And then I was like, well, they recorded data for an hour and fifteen minutes.

15:16

I just thought it was weird. Why an hour? Fifteen minutes, not two hours, whatever. And then just like if I take ten minutes of the data away, basically something to track, do I get the same result. And then the last step, the mother and let's sue after couple futures, things just not stable. So that's basic. The first step to what I said, this is disconcerting. So we develop easier.

15:41

You see, ECV tried to put stability on top of cross-validation and got much smaller like possible voxels in case you want to intervene and put some different stimuli there. And then the opportunity to give the Tukey lecture. So start thinking at this. So actually, we have so many different forms of data preservation. And Will, you did do one. And then even if we do all of them, do we get the same result?

16:12

And the new data portability form we want to include in the piece C as framework is we can take synthetic data through mechanistic 3D model that's become a form at the missable form of data preservation. If we leave that a proper a framework of cards, you can also still have stochastic differential equation. But events and say they're from a different kind of, you know. You can have a stochastic, Mulder, based on one Pupper PD.

16:40

You might have a ternary PD. Then you can bring that in without using a bigger PD model that includes bugs. And then choice of data modality. Right. You can have we do data. And I'll do data on Piak. It's it's a ward organisation to send to different country to look at the skills of different countries, a workforce. So the interview can go to some of his house. And Mr. Test is audio or video. Right. Would you get some data and data differentially if differential privacy is a form of stability?

17:15

And the most noted for how? Pay attention to deep learning, which is hard now to this. Is that adverse or attacks that you can competitive like medical diagnosis by doing some adverse attacks on the medical images? So all of this. It can be viewed as different form as a partnership, and we can unify. And then depends on the purpose. We could choose different forms. And of course, robust statistic was Chukchis attempt to look at different models, to look at the centre of distribution.

17:46

You want the century statistic to be stable for long tail where discussion do. How many different comics? Optima Non-Com I subtrend station probably end up with different modes and sensitivity nights. Invasion modelling's also follow of perturbation. People are aware of that, but we don't really do that very much humanity. She's a researcher at the Researcher Perturbation, right? We do for our straight faculty colleagues from Oxford and give them the same problem.

18:18

Do you guys get the same results and is qualitatively we don't even try. So now it's all the software availability. I tried to promote parameters and the comment. Scientists already doing that. You always see it as Paul is nine different climate models. And for the global temperature prediction, you have a whole interview interval. So they kind of already do perturbation nonsense and just want the other side that we need to make all our human judgement calls.

18:46

Transparent. There is no magic. You're documented. And you make an argument. And this is documented. So this is just a list of things we do. We choose. All of that has to be documented. So you have models there, amount mental constructs, and you have reality. They don't have to come. I just feel you write on X, suddenly X means a gene. So we do documentation is a brain step. Unclear.

19:18

So you have to make a break by doing quantitative and qualitative narratives with, say, our Mike Dong, oh, troupers, your notebook and just explicit make the judgement calls written and people can judge whether they trust your judgement calls or not. Otherwise, we're not on solid ground. So people often ask me about how to choose perturbations in the peace framework. You can do all of the perturbation. We never go home. We can just keep put her because everything.

19:51

So it's not feasible. But if you pledge to stability principle, then you wouldn't want to do. You know, naturally you want to think harder about what other perturbation because you've had enough perturbation, you won't have anything consistent. And this just requires the document and make the case and choose your perturbation carefully. And that's part of your evidence. So let's come to the part which I will follow up in the Peach Tree project, which is inference.

20:28

Right. You can do predictions, stability and noses, but people then want to have a measure of strength of the evidence, which traditionally has been played by t betting. All the work I have done. We never had institution, never have the decision power. We provide evidence to help experts to make decisions. Maybe FDA will have that. But FDA probably also have medical experts agree on certain procedure.

20:56

So the key for most of us is really provide data evidence in a transparent manner so that experts can understand and make decision maybe with us. And P-value definitely has problems, as we all know. So if I take a very critical look at the properly the statements institute for you in France, you see that?

21:23

I definitely dated for many, many years, if not decades, go to my statistic, classes start with random variable and just start saying no, this random hour represent my data without second thought. But if you take a step back, which later I did, is that saying the data is a realisation or random process associate assumption, even when we talk about random variable lives, that.

21:48

If you only care about a date, I mean, you don't really need random Laravel, because whatever it is is it's really random marabous meaningful. But you want to think about two realisations of the same random larible and say they come from the same. It's a way to tie two things together. And that's implicit, assuming stability. Otherwise, you don't need a Renoir. But you only come out of data to describe, to tabulate.

22:12

You don't need a random sample. So when we use random variable, then bringing some assumptions or. Right. But when this assumption is now substantiated, oh, the properties, his statements become questionable and often the model structures is not right. Now, we only deal with assuming the mellado structure is cracked and then do with no quantitative measures of different parameters, open small provided a measured model bias.

22:41

And I definitely think we need to move away from the word true model because, you know, you tell students a true model Echinacea is not true. It's better to use approximate postulated model just for consistency and. What we really do. So the PCOS influence really tried to mitigate some of the problems. I was just mentioning. So when do we used to do diagnosis? Right. But for regression, you know, we don't have classical Bokan apply regression analysis, exeter's the chapter on diagnosis.

23:18

But because the high dimensional data become very difficult. So we don't do it and not jump to insurance right away. So think is good to bring in by compute the unpredictability or prediction which Michelle Obama made very popular. Just use prediction as mother checking. And then stability. We want to now expand sample to sample my ability to include even say I formulate the problem, but the genes is important in two different ways.

23:48

The sue a random forests and now having importance. Magers and I can put them together. The ranking become comparable so we can go beyond the same class of models to talk about some measure of uncertainty and computation. Of course, this require action. One frontier of research, PCOS, is actually how coming to stability analysis streamlined so that a sufficient. And the goal is really you can move away from always thinking about.

24:20

It's a proper artistic statement. When we don't have a probabilistic model justified, when we do, of course, let's do that. But often we just use the probably Comodo as a surrogate. So get something going. And then I think things better not to use the probably frame and use perturbation tables and is transparent.

24:39

So to be specific, you can follow me there probably multiple ways and then you can have the target of interest, which is if a comparable cross, a different population, say, the ranking of change from the sewer ranking of genes for random forests, then they become comparable. And then we do a sample split training test, which takes a lot of thinking. If you don't, you're not in the idee case. And then we scream. The model is based on predictability.

25:09

So the ones who don't pass this grey training data, we can maybe new cross-validation set of validation parts aside. We're seeing the training. Then otherwise pass the screening. You start worried about a state is like a piece s inference, then you use documentation to argue for appropriate data perturbation and other perturbation and then formulate a perturbation interval for reporting which can be useful.

25:35

P c. S p value as a form of evidence about preservation of stability knows after model checking. Any questions? Yeah, I have a question. I mean, very, very nice, but, um. So a lot of from what I understand, a lot of your preservation and DDA is validated using projections predict like you you cheque whether the predictions is correct or not. And that allows you to sort of get away from the Moutet assumptions in a sense, if I'm so right.

26:10

But if you're interested in inference, like you don't only want to do good predictions, but you actually want to interpret some of the inference that you're doing in a sense like showing to the BP Titos, you want to do inference because you want to take a decision on something specific. And what kind of stability checking are you going to do? Can you still do prediction that? So is there something else? Can you just keep stability cheques without more?

26:37

Would modelling behind all this makes sense? So the next part of menu will answer your question is to a certain extent. That's who our first case actually go there were to do decision making. But we still use prediction as the screening and then we have models and then we compare a now distribution for perturbation and have a reference distribution and we come up with PCOS providers. So maybe if I go to the next part, maybe we'll keep some answers.

27:09

So the piece says the inference still pretty is the newest part. I mean, the original paper, we we've said this, but we do that. We have some ranking comparison, but we didn't go formally p value confidence interval. So this project, which our share is tragic. All there in the. Contacts of his status discovery. So please know, ask questions, I gather. Yes. Thanks. So the second part is for every static interaction based in non-linear interaction between genes.

27:45

And this is part of like multi discipline, a multi institute project called the Monty's Go Deep Learning. And single cell models for cardiovascular health. It's so the bio hub, which is next to UCSF. It's a West Coast kind of broad institute. Started a couple of years ago and they called for Inter Campus Award, which need people from all three campuses, Stanbury, UCSF and Berkeley. And we teamed up with a cardiologist from Stanford and UCSF.

28:23

And my colleague institutes a plan, BRONG, where the data scientist and we have many posts. I have been working on this. The particular thing we did is the first step of this multi discipline, multi institute project is really to locate non-linearity interactions. And we develop this massive Kulp IP train. The paper is out by archive and has actually a different name called the Linear Learning Epistemic Phylogenetic,

28:51

you know, corresponding interaction being assigned to journal human genetics. So it's kind of tailored to all that. I want to climb down the two young people eating they two bright young scientists, a girl who is supposed to now returned to Germany with spare and calm, complete. My former student now at UCSF and the two senior. Do you like a James Priest who is a cardiologist from Stanford? I myself.

29:18

So at this stage, this is really a name fall non interaction and might have been given by feature in the paper in 1919. And this is a classical case and textbook fall. These days is you have you Drosophila fruit fly with brown eyes and skalla. So Brown actually creese found two dozen red. Pigment. In the Drosophila and Scollard executive at the Bob Brown counter, so the names on it.

29:48

Now the company in line, if you cross them and get, you know, F one, you get wrapped around it to know kind of jeans work I get. But if you do the second generation crossing, you actually get a lot more. You get broad scholar and you get what? So this is example, not Nanine interaction that you never seen my.

30:12

And something occurred when you do the second generation and the GMG interact and Fisher basically formulated syllogistic model which multiplexed interaction about problem formulation and try this Nanine interactions pretty awake. But then it got translated into a logic model. Was then your term. That's statistically and pretty normal term.

30:37

The problem is that depends on the scale, which is now logic and t piece the penetrance, which is the probability of getting, say, a particular phenotype giving Gene A and B. It depends on the scale. Right. If you have some time, how much can you take log scale, become additive? So we still wealthy? Fine. And there's also mathematical theorem saying that any function, if you find the right scale, can become additive. So the scale thing was bypassed. Just took logit.

31:08

And they're also evidence showing that, you know, multiple decades of logic might not be the right thing. Which Polje. And that when we have like like 10 sovern barons, as you see no data then is competition intractable? This polynomial may in fact, it is quickly, girls. So you can you have to cut through this. Let me in fact, mentality. Both for computational. Also tourism. Because often the genes and not important march alone. But in Parliament, they interact.

31:42

So we decide that we want to prove the concept with the methodology, Barnham, before we go to the cardiovascular disease. Because for MEAC, for any human trace has been very challenging. So we decide we have access to the UK buyback data. We decide to go for something relatively easy by a human trait, which is red hair, which is self reported. And we have about 500 South individuals. And we can, you know, I think. About 30 salved on them. Fifteen soundin them self claimed red haired people.

32:18

And then we just matched another random sample of the same side to make the searches solvent. So it's people think it's genetic. And believe is controlled by Epis Stacie's. And it's pretty common that we don't have a problem having the data. So we want to divine something which is flexible now, nature and. Chooses scale, which is small. Make more sense, that is to us. And then we want to detect interaction. More than. Alderton.

32:53

So the data came with with like a 10 million snips of variance and there's a very common data reduction scheme, like a pipeline developed by people based on t data. Different tissue. Call the Predix scan and then that will translate impute gene expressions from the variance. So that's what we used for cardiovascular disease. Actually, it's not a good idea because a lot of the brain signals when you have that tissues, doesn't really have the right signal.

33:25

So helpful fault for the pigmentation. It's OK. So that's the first that we did to improve the gene expression and dimensionality reduction. And then we use something called run random, pass our introduce basis, saying other genes fall on the same path. And we have efficient way of searching it. Then these are the genes called selected kind of interaction, but also model selection.

33:54

And then we're used to selection genes to my back to the veterans and we reduce the dimensionality of the snips and then we do the same thing. So there are two how Ommaney concentrate on the gene expression pipeline. So far, the positive control red hair phenotype from your biobank. We end up with 30 solvent, as I mentioned, subjects balanced. Fifteen in Casement's red hair and 50 solved and with other self hair colours.

34:24

Premium one uses random for it, random them far ellickson juice and they selex stable predictive. Interactions. Of our two, Ohia. And they each interaction gets a scar because stability score. And that's how we rank them. And we just cut it at point five. So was it was random flyers, extras. Can be viewed as a special case of pieces as well. The paper published twenty eighteen as the two young bright people did.

35:01

The work is Samantha. As soon as this professors come out again, Cochlear and UCSF, the two senior leaders at Bamburgh myself. So last year, the project was probably like twenty sixteen oh tentage. Fifteen ballasts working on genomics for a long time. I was not. And Ben had been using random forest and really liked it because really gave me a good predictive results. But it was very hard to interpret because I would change the data about the stability issue that you.

35:35

The jeans on the same path would change a lot. So this using jeans on the same path as interaction. It's not our idea. Many people have done that is saying that you two jeans fell on the same path and decision tree in a random forest. We kind of make this leap saying that they might interact biologically so you can see on the same path. It's a mathematical interaction, an aggressive interaction. But we believe they might interact biologically.

36:07

So what we did was add stability to random forest and follow these protocols. You actually improve predictive accuracy. We use the importance index to do weighted sampling for the next step. A random forest. And we also use random intersections, your Qamar housing to fight intersection path. And then we have an outer loop begging to assess stability. So you start with you from future.

36:35

And then you danwei to importance. So this iteration, two, UJA, three to five, this is nothing from my experience. And so some genes get emphasised and the models, genes getting shown. But we don't delete any genes. Of course, the genes, the waiting's very small. You never see them. But this way, you allow something now so important since enter because you are constantly done. Kind of. The thing is, the importance measure, which we're still trying to understand, is not just manufacturing.

37:11

It has more to that because there's a lot of correlation glassine. So the ME using the weighting. It's not the same as fading editing model and looking at main effects. So when we have all this trace, I would just use the same random forest, that kind of algorithm, by changing the weight.

37:31

And we have to collect the different parts. And that's why we use a generalisation of this random intersection trace, which was from my Canadian basket that you have to set with zero one like man and woman buying from shopping. I want to see the shared like items purchased purchase on man or woman. So what, we're 10 days into a zero one by taking each pass and turn that into a zero one vector. If X1 gene is spread, it gets a one. So look at the past and put older ones for the gene scale split.

38:05

And then otherwise zero. So you turn into the same problem as Schaar a month. Housing was that. It was. And they have a random way randomise algorithm to collect the shared kind of path. And then they have some research showing that if the share pass is sparse, you have positive properties, you're getting them. And then we have the actor, do you have all this collected collections and then each collection gets his car by a strap.

38:35

So we use random five worthwhile parts of fiction and random intersection, triple computations, stability, which is added certainty. We did. Trace. To get it selected. So far, the original paper will use Drosophila data for the prediction there. We had discovered like 20 pairwise interactions and 80 percent of them were actually already. There were physical experiments, not starting from the 90s, verifying that things were actually found actually biologically interact.

39:11

So with the red hair data looked like that is that it was C curve. And we have full curves here. And the best is the green one, which is either random forest. With the snake level. Right. So was this election and. And then the next, why? It turned around and forest at the gene level and the others left sue at the gene level and ranger base is a version of random, the ring that you can see that actually the stability outperforms random forest for that Drosophila data.

39:44

We worked on didn't improve by the prediction profonde was similar. So it says I run so far it's through doesn't happen. They say that enough organisation's instability added more organisation and help prediction. This is our health, our test data. Also, banker follows suit. So the suit has been used by red hair in previous work. So we do quite a bit. Better. And then we look at the genes, discardable our eyes.

40:17

And then you can go to go look at the go term, which, you know, James did, and find that they seem to make sense. This is all juristic. And then also, James looked into the protein protein interaction in Richmond. And also find the names of Gene Rex seem to make sense. But this is all heuristic just to like sanity cheque. So this is a screening. And so going back to ethicists now, I have this interaction so likely bar. Right. So they come. All right. The output is ANP interact, all ABC interact.

40:52

Now, we now want to look at how we going to detect this MEAC interactions. You can go from multiplicative in terms of interaction form. You can just direct full cart of decision tree. That's right. Interpret both form of interaction. Oh you can just do random forehead's which is not right. Interpret polarisers managing side. But you have a noun in your surface in terms of scale. You can do logic assfish or it can do penetrance, which is pretty straightforward.

41:21

So we felt like khat is interpretable as our previous work interpreting intracellular biology. And those of us in penetrance is makes more sense. So we just went for penetrance and duology because it is no good, really. Ballajura argument for going logic seems to be a tradition. So what is penetration penetration is the probability of a red herring giving Jean Amby. Yeah, so it's just a probability in binary classification.

41:57

But in genetic, it's called penetrance. So the Fisher would do it Longet of the penetrance and ride it as a main, in fact, and interaction in fact. So we don't do the larger power, just directly model the property of Red Harry giving GNB into a cartful like this Asian triphone.

42:20

Thanks for the question. So just give you a sound science that actually card model is the only way they're trying to justify this is in penetrance scale that on the right hand side is the smooth proportions of red hair with two genes. Cut a sip, which is, well, no red hair. Gene and Tuff's three. So you see the stripes, because this is computer data from this great data. So this tribes', because snips take bad, it was zero one.

42:51

And the cart. So on the right hand, on the left hand side, you have phone LoDo fated surfaces. The first is just Lainer J. Plunge into this. No, the gene expression is continuous. It was smooth. And the outer bands with the interaction term multiplicative interact and ended. You can see they look very similar. I'm full. The card we have, we fit cart to a car to be and do additive. And then at the assisted Valdo, which is not Iraqi, feed the car interaction term.

43:24

And that actually if we use our P C. S P value, it turned out to be significant. Which I'll explain how we do that. You can see that the stripes are kept or captured better by part model. And so some interpersonal. So this is the example of acid and which analogy which turned out to be not significant, to fit the mould on the training data. Jerry interposing SCDP lesson minus point five. You know. Don't have red hair. And then that's what you have, right?

43:58

So this is a country, the probabilities which at the end nodes and then the kapi has little. Level two. That's the Nung. Oh, no opposition model and this model, is it just fit the interactions together so we can see that incident. So the first is split is on that eight. And then you said depends on how strong as base you're going to split differently. Oh, you spit on that again. So this is announcing their interaction, right?

44:38

So now the inference I think maybe your answer to this question is so now I have all these interactions found through eatery to random farm. I just take the first line and decode what I planted there. So if you look at concentrate on the first. Two bars. Orange and. Black. For this interaction, look at the test set that this is the prediction error on the test, that. Orange is the Narmada bigger than Etty Stacey Monda, which is black and the vertical black line is the prediction error.

45:17

If you use all the. Jeans fund by F. Not just the two. And then use R after to the prediction. OK. So of course you do worse because you only do two of them. The asked framework says that we actually use this as a screening. So unless the black bar is shorter than the orange bar. I'm not going to do any P-value if the orange bar is longer than the black bar. I've just cut P-value one. So we're done. We're not going to do any further evidence sicking just, you know, the now model epis.

45:56

This is model. No. No deficits, the model needs to do better, prediction needs to be shoulder otherwise with them. So that's what you don't see any further buyers for the ones. It's the other way around. So why did the orange bars longer than the black bar, which says every station mother is better than you go on for the calculation of PSP batting. And then what you see on the right legen gives you the peace, SPDM.

46:28

I'll explain how we did the calculation. OK. So this is what states that that's all the interactions. You know. Being screamed out because the prediction and then the one's past, the prediction screening, we go on calculating the P. P. S P man. So at a high level, as I said, if everything gets the worst prediction, which is said to about eight to one. Otherwise at the high end, I will give you details. Next slide. Tibetans calculate based on refind complete comparison between two models.

47:03

By taking into account past, they have variability. So we sample from test data and as a result, we have more reasonable P values instead of if you lose those just multiplicative, not chi squared test, which then approximate again, very, very small sometime even to the minus 17. And this industry, which only two genes, but it seemed to work for high order interactions. So here is some detail. After passing the screening, we just sample a photo. We have the model ready. Faded.

47:40

From the training data and from the Narmada, the editing model, you can use that. James. To predict the probabilities. So you have the test set and you're going to sample from the noun distribution, using the probabilities, using the jinx pass until you aren't the model. And that's the name of them. And then for the. For the tentative wait. We also sampled from random sample from the test data and then. Calculate based on test statistic, which Polony likely ratio.

48:24

Now over a tentative and then which just calculate. So every time but each bootstrap sample we we have a non sample. We have a plus tracked data sample and we compare which I have stronger evidence and which is to the average over the bootstrap sample. There's a you can do a normal approximation, but the number of bootstrap, something that's big and so big you don't have to do otherwise very computation. So some comments are in order, is that true, says Preval, it is conservative.

48:56

Let's see if we see this Patrique example have a smaller P values. And for some spectral eye models, we can do some calculation, which is still ongoing, that the peace? S non-proliferation distributions seem to be a fat in version of the shop. Now distribution. You d you come in with a precise model. And we see that it's smaller than the faces on P-value. So it's kind of a more robust fight, a traditional hypothesis testing. But this is we didn't deal with data claiming perturbation.

49:30

I can't put that into us in one form or perturbation, which we are only addressing this in the modern mainstage. So it seems I'm kind of where it is a time and Andrew say that the interaction we find a buyer level. No, identify the two impotent red hair jeans and see while are there's actually a magazine with a name. And as I said, I think it's on chromosome 20 and and see what art's across them.

49:59

Sixteen. We find a lot of genes, which is near one the two genes, because they are next to each other and have similar functions. OK, I'll tell you something. We Divino you know, I could on a superheat student butter that we can look at. You can see that. So on the right is the superheat that this is some feature based on the different interactions.

50:21

And you look at the columns, they are different individuals. You almost see like Tracey twitchiness groups four down to the red hair, the first block. It's like you have light up, cross the different interactions of different genes. And then the middle group is like black kind of light up on black. The last. I have no different patterns. So this is a way to look at. Oh, really? Who are the red haired people and the people who are not? Some actually have a little similar signature.

50:53

So to summarise, we actually find three out of three our interaction, which hasn't been seen before. And then we also have some new discoveries. But this needs to be validated. And this is more suggestive. And we recovered the genes people, right, you know. So to summarise, right, we have this whole pipeline imputation and R.F. kind of nonlinear interaction amongst the selection. And he says P-value to make the decisions on each of the discovered interacting genes.

51:26

And then we make a decision and then you can do a similar day for this snips. To summarise, we propose this law review data science framework through the three principles and documentation, and my group has that eight different studies. So we're pretty confident that this is a conceptual framework. There's still a lot of detail to be worked out. And three or four of them were actually motivating samples I raised. And you have them. Marks a mild case that included arbitrary.

52:00

And those most recent work we did is with a yard doctor from UCSF. It's really use PCR to stress tests already existing clinical decisions that to evaluate it. And actually the decision will pass. So decision was how to treat a kid. Is the paediatric emergency room E.R. with the abdominal trauma injury, whether to send this kid to a C.T. or not? So we basically use it to stress tests to evaluate this. They pass pretty well. We're also external study.

52:36

So it's both use for development of decision rules, but it is also for evaluating sitting ones until May not is extremely important. You can see what my judgement calls about why the card decision tree in genomics makes sense. And we hope to generate hypotheses for external motivation. So back to where I started was by how project for about her health. This is a lot more challenging, right? We need a new method actually to really do.

53:08

The data we have. We have FMI and we extracted a different one dimensional features and gestures. No predictive predictability. And this is rare disease. Why over 500 people have this disease or particular HCM called hypertrophic cardiomyopathy. It's like flattening of your heart, more left ventricle war that. We don't see predictability and stability of those variable. So we need to get better data or back get better phenotypes and also the HCM diagnosis.

53:47

I'm married and we have the older people from the UK Biobank. We need a younger patient data and the genetics to really show. So right now, we really. Needing some alcohol, we are kind of in a dark age for this project. Which is a lot more difficult. It's well known that human disease rallies is difficult, but we hope we can access standards that our cadabra. You were actually they started collecting their own data.

54:16

So if you venture to have some good data, but right now, the UK, by ban, the signal is very, very weak or it's not there. Again, I want to send my group, really my group, to really concentrate on problem solving. But we recently returned to theory for a deep learning theory. Well, you know, it's picked up by Cumber and other people on this foundation. So Deep Learning Theory group. I also take a science grant called Floating This Many People.

54:42

And although we finish in the first paper looking at mother selection, property of Random Farje that by Murro and also my students year one. We're also finishing a book with my former student kind of post on Rebecca Botter exam entry price using the preset framework and deal with the entire data science lifecycle. I tried to help make the break from reality to symbols and we use mass, but we do not intend to be a maths book.

55:16

So thank you. I hope some of the ideas will be useful for our projects and all the papers and codes for the last two. Both our F. We always have a spark parallel computation version of that with my. There's a tap cult code and we have all the codes available. Thank you. Thank you. He is very far from cock up. I think we have a few more minutes for questions and I'll start off with Judith. I'm sorry. I have again, like a. Nice. Again, let's say you do either way, too, right.

55:56

Don't like a framework so that you can prove something theoretically when you're using your sort of workflow, in a sense, the speciate spiciest type of insurance. Yeah, we have a little bit derivation, stuffiest immolations for very simple Intergration mother in that Appletree model. But that's where we like to do you have to assume with stochastic Malda and then we can probably derive something without the data cleaning and cleaning house.

56:26

You can do analysis on data cleaning. But if the mother thought if you do idee case, that's what we hope to do. So to prove anything, you would need to refer to some kind of probabilistic more data. Yeah, yeah. So we can probably do this. We're seeing simulations under different we specify models to show that our kids. Yes. Provider is more robust. I think that's possible. And we did quite a bit similar to a very simple case of one simple linear regression.

56:55

Only one with one. Correct. And then we show that from simulation that while you wonder now model is wrong. That with some variance wrongs. Now the homosex that stick. And then we are more robust. We give more reliable result than the classical way. Yeah. So that's actually supposed to be ongoing that we do want to do that paper to do. Alysse, following our models can get some, you know. And that ethical insight into what's going on.

57:26

So I've been taking the approach that I want to see things useful first. If I had to theory. So I kind of I convinced myself that's useful. And now I'm getting ready to have different stylised models and do some theoretical work. Yeah. So hopefully we'll finish this year. Yeah. Okay. Next question will be NASA's. Yeah, hi, thanks for the talk. Is there any connexion between.

58:00

The methods you're using and the kind of things that are in the paper by Dukes and OneSteel and which is kind of attempts to give some sort of premature as the model in such a way that even if your model is not quite correctly specified, it will still give useful answers and that you can kind of build in machine learning. You put it on on top. What kind of machine learning algorithm for learning the nuisance parameters, if you're familiar with?

58:30

I don't know that paper ballots of from your description. I would think that that that I think in high level probably tried to do the same thing to get robustness. The detailed implementation, probably my they are probably more descriptive than we are right now. We are pretty. I can no stick. And just let people choose their models instead of, for example, in our framework. I think Bayesian model nomination models can be integrated together.

58:56

If you help same target of interest, that's comparable. Support them. I think they probably more like what people do you can do. Confidence interval, Sitiveni. Aggression. And you have a model. Like, don't use the fish information, you have a sandwich covariance matrix. My guess is probably they're working maybe in that. In the spirit of that work. So, I mean, so obviously is focussed on robustness. But the point is that they don't. So they didn't just fit together.

59:29

They fit it in such a way that that's the in estimating equations that much by the likelihood and the estimate equation is chosen in such a way that even if the model is not correctly specified, you still get a result that you can interpret. And that allows them to sort of play around with the nuisance model so they can have something really, really complicated for that.

59:50

And even if it's not doing exactly what they want. They know that the sort of metric they're interested in will still be interpretable in a nice way. So, yeah, I'm sure there's a high level of connexion, but specifically, I think we're not very prescriptive yet. Robin's leaving a lot of room for the user to to make choices. So I would say that this I, I say that that's the particular way they made a choice to bring the stability. And I think if I can think of this as a way to.

01:00:26

You know, a mother perturbation to introduce, I probably can put it under this broad framework. But for them, it's pretty specific. It's kind of relate to what they did, is there something called double robustness, double machine learning in causal inference? So that's why people have to come to different models and then they have to step method, which will adapt to either. But people also show that if neither is true, then the thing can be worse. So so I think definitely sounds.

01:00:54

Related, but we are pretty conceptual right now that we don't really prescribe exactly. We did the different choices in the different context of the problem, mostly a lot of related random past. And there seem to be multiple magic. But I think you asked me to pay probably very interesting looking. Taking a look at my email with me at Berkeley WTU. Thanks. We're running out of time, but I wonder whether there might be some more questions.

01:01:28

Yeah, feel free to show me your e-mail. I'd be happy to continue the discussion. I'm waiting to see whether there's any more question. Cool. I guess if that's a normal questions, do [INAUDIBLE]. We have an e-mail we have you've collected. Conflict is often.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript