(Not) Aggregating Data: The Corcoran Memorial Lecture

00:00

Four are Corcoran Memorial Lecture. So I wanted to start by just giving an explanation of what the series of memorial lectures is. So this is this year's Corcoran Memorial Lecture. It's named in the memory of Stephen Corcoran, who was a graduate student in the Department of Statistics until his death in nineteen ninety six. He had been an undergraduate at Oxford and got first class honours in mathematics in nineteen ninety one.

00:31

Then he went and studied for a diploma in mathematical statistics at Cambridge and returned to study for his DS fill in statistics at Oxford. And every other year we have a Kroeker memorial prise. But. Without fail, annually, we have a corporate memorial lecture. So this year we do. It is not a prise year. So we had free reign to choose the topic of our speaker. And so we asked Karen Magnuson to come in and speak yesterday. So we're very pleased that she agreed to do so.

01:07

She is a distinguished professor of statistics at Queensland University of Technology, which is in Brisbane, Australia. She is the deputy director of the Australian Research Council Centre for Excellence in Mathematical Frontiers. She's a fellow of the Australian Academy of Science and the Australian Academy of Social Sciences. She's also a member of various national and international statistical societies.

01:34

She has wide ranging research interests in both mathematical statistics and its applications in areas including health, environment and industry. In particular, she focuses on Bayesian methods. So she has very kindly agreed to present to us today. And so we have this morning scheduled lecture because she is speaking to us from Australia in these covered, restricted times. So I'm just about to hand over to her. We will have a question and answer session at the end.

02:08

So if you have questions, please types them into the chat and I will read them out to her or a selection of them, because we probably won't be able to answer all of them at the end. And otherwise, thank you very much for joining us. And I will hand over to Kerry for the lecture. Thank you very much, Crystal, and thank you very much for the invitation to present this lecture. And and thank you to the people who have given up some morning time to attend.

02:42

As you heard in Australia, it would be great to be there. But but in fact, this is the way it is at the moment. So I'd like to just share my screen. And. And. And we'll talk to. Can you see my screen? Yes. That's great. Okay. Thank you. So I'd like to talk to you about not aggregating data, and I'm going to do this by first presenting a couple of case studies where the imperative for this.

03:22

This area of work has come about. And then I'd like to talk about some of the work in progress in this area. So this is a new work that we're doing. And if people are involved in this or have some suggestions or comments, then I'd be very excited to hear about them.

03:40

It's certainly not complete at the moment. So the first the two case studies, the first one is the Australian Cancer Atlas and the Australian Cancer Atlas was is an interactive online atlas that gives brings together cancer data, GISS location details, digital earth technology and statistical models to provide an online interactive map of cancer. About 20 cancers in Australia at the small area, a two level. So there's about two thousand of these areas across Australia.

04:21

And this map is based on the underpinning technology or modelling in this atlas is a Bayesian statistical model. So we have here why eyes the number of cancer cases in a small area and why is the expected number of cancer cases in that area? Agent six matched. This is, of course, on models. So we model our number of cancer cases as on. And we have each of the new eye being the the relative risk of the kind of cancer in that particular area.

04:55

So you are then is modelled in terms of some potential covariance EXI and also a spatial term S.I. So this spatial term is going to incorporate the information of the neighbours in the area. Given that we are in geographical space, so we have a number of options for that spatial term. For example, a base egg yolk Mollee, a model proposed by basic getable in nineteen seventy four is is has a term splits up that aside, the residual into two parts.

05:33

One is the, the spatial component UI and one is an unstructured component A.I. and the UI is then the spatial components depend on the neighbours, you Tilda Eye. And so it's going to be normally distributed. That random effect is going to be normally distributed based on the average of the the random effects in the neighbourhood of UI. Now an alternative to that is from Leroux in two thousand and his colleagues. And here we model Essi as you are and you are has two components.

06:09

One has a it's almost like a mixture. So it has a component that has something to do with the spatial neighbourhood and also a component that is the overall variance. And this then allows us to have some we have this mixture model that is governed by a parameter ROE, which says how much weight we want to put on the the neighbourhood. The special neighbourhood. And how much we want to regress towards the overall mean.

06:41

I know this LaRue model, Ben has only this one parameter, Roe, and turns out to be quite a good model for the kind of spatial analysis that we want to do in the Cancer Atlas. So the Cancer Atlas then, as I said, is this online model, and I'll just show you some of the features of it, you can you can pull down any of the cancers you like.

07:05

You can get information from those cancers, in particular the visualisations about the probability of being above the national average or below the national average. And we're talking here about uncertainty and estimates and we're also talking about probabilities. And so we're also able to compare areas and and provide the information as well in the form of Fantham modelled estimates at the small area level survey Cancer Atlas has been. We're going to extend this to space time and.

07:41

And already it's shown quite a lot of information in terms of, for example, spatial inequalities. So this chart, he has showed us the difference between major cities, regional and remote areas, in terms of for females, for different kinds of cancers. And you can see here some really startling differences in the standardised incidence rate on the left axis there. This is all compared to the the national average, which is one some stark differences for people in remote areas.

08:16

You can also get from this information. And we've been able to provide relative survival curves and relative survival estimates for each small area level. And we can then look at localised and advanced relative survival curves and the differences between the different areas. So remote areas and cities. And you can see here that there is this kind of real difference that this that lets us being able to to reveal.

08:47

And we can ask questions like how many premature deaths could be prevented if there were no spatial inequalities. And so before by being able to quantify these terms, then this is really led then to and a real media release or real public acknowledgement of differences, spatial differences in cancer across our country. It's also led then to a changing government policy in terms of the Treb travel subsidy for people in the push to get treatment in the city. So we think this is successful.

09:25

And what's interesting for us is that this is a product that has underpinning it a spatial statistical model, a Bayesian model. Now, we've also been looking at just more recently, spatial Bayesian empirical likelihood so we can avoid some of the parametric assumptions in in our modelling.

09:44

And so the model here looks the same as before. But what we're going to do, instead of assuming a normal distribution for the low relative risk, we're going to have estimating equations and I use these estimating equations to be able to derive our parameter estimates. So we have the typical estimating equations here. We're going to have the constraints on the main and the variance. And then we we also have priors on this and we can obtain the estimates.

10:14

So the new part here is, is including the spatial random effect in these empirical likelihood models and the kinds of priors that we can have for our random effects of, for example, an independent Gaussian prior excuse, the spelling of Gaussian. We have conditional order, aggressive price like the way I am and LaRue as before.

10:37

We also have, for example, a generalised moron basis prior that has been used in previous work on spatial empirical likelihood in the paper water et al. So there's not a lot of work on spatial empirical likelihood. And this was a contribution to that effort. The speed at the stimulation results, when we look at the the the empirical likelihood approach compared with the parametric approach.

11:05

You can see here in a simulation study where we have hired a correlation, low order correlation and also outliers. And we have that for small a small number of areas and also a larger number of areas. It turns out, as we might expect, that the the empirical likelihood approach really shows its colours when either we have a small number of areas. And so the normality assumptions may not hold out or we have these outliers.

11:32

So when we had this 20 percent of outliers, then our base empirical likelihood approaches seem to outperform the parametric approaches. So this has promise for our atlas. Now, that's fine in terms of being able to model. But but one of our concerns still in this is. The issue that we needed to aggregate the data now Australia is based on a number of states.

11:59

Each of those states holds the data and we needed to have agreements with each of the states in order to be able to bring those data together and then to create the models. Now, this has to happen regularly. And if we want to change anything or obtain new data, we need to go through those agreements again. There are very real issues of privacy and there are also issues of of obtaining permission to for these data. So the methods that we have at the moment don't address those issues.

12:35

They certainly address the privacy issue in terms of providing the information at the small area level. But we still need to aggregate the data into one database. Our second case study here is the Virtual Reef Diver Project. So the Great Barrier Reef is one of the world's treasures. It's two thousand three hundred kilometres long and there is monitoring at certain sites along the reef. But there is a lot of area where there's no monitoring.

13:06

So what can we do in those areas in order to be able to help us to develop accurate and useful predictive models across the whole of the reef? Well, this underwater drones, the satellite imagery, and there's numbers of divers who are out there collecting information. So we built an online interactive reef. So it's called virtual reef diver. We are inviting organisations and citizens to contribute or geotag their photos, the images of the reef.

13:43

And we are also asking citizens to go into the into the images and annotate them. So if you are interested in contributing to the beach, the health of the Great Barrier Reef, please visit Virtual Reef, Dorell or you. Now, what we have then are these images which you can download the images of the underwater images of the reef. And then also classify them. So we ask citizens to annotate the. The circles here as to whether these are coral algae, sand and so on.

14:20

And we can use that information to improve our statistical models. There's a huge impact of this work. This this project was selected as one of the three shortlisted for the Eureka Prise in Australia here last year, which was very exciting to us. So we have citizen scientists who can annotate the images and we can use those annotations in their own right. But we can also use them to test for automated image classifications. So machine learning and statistical methods.

14:55

And you can see there the list of all kinds of statistical methods that we've looked at for classifying images in the environmental space. But then also in the medical space now, at the last of phase is matrix facts factorisation. And I'd like to just concentrate on that for a moment. So in particular, we've been looking at Canalis sparse Bayesian matrix factorisation. So the aim here is to extract low rank or sparse structures.

15:25

So, for example, our classes in our image and we're going to use matrix factorisation techniques to do it. So we have data, which is why an M by N matrix like an image. And this can be the the the values that each of the pixels in the image. And we want to recover. Then the low ranked matrix X where we have Y equals X plus eight. And this X is going to be then decomposed into two vectors.

15:51

You and V. And these vectors, you and V are going to be much smaller than the original Matrix X. So you if we have X being M by N, then we have we have V, for example, being in by R. And we have sorry you being in by R and V. Being in by R as well. And so when we put this together, the R here is much smaller than M or N. And then this induces sparsity into this. This low rank approach. Now we also can use our priors to induce sparsity.

16:35

So we have calcium price for the columns of U and V. As you can see here. And one of the features of this approach is that we have the same the same parameter, Gamma J, which is going to appear in both the the variance for you and say this. Then we can control that gamma J. To induce the sparsity. So we have the prior on the gamma has a gamma distribution and the hyper parameters of that prior distribution are going to be small.

17:08

And this is the way that we can really induce the sparsity in you and V, we coupled you with a kernel matrix K. And that gives us a light matrix G. And this G then has a prioress you can see here. Now we also then have Jefferys prise on the other parameters here. So we also can do the same kind of a prior for V as well. So just to complete the model, we then have a residual turn and we have a prior on as well.

17:42

But we're focussing really on that you and be here. And so just to show you a graphical model of what we're doing here, we have y y is going to now be governed by G M and H dot n and G dot m is going to be a combination of K, you and U M and Hijau N is K, V and V. And then these are the hyper parameters on the swell. So these some we're going to induce the sparsity through these Prior's. We have conditional and joint distribution.

18:19

So we have a conditional distribution for our observational model, which is going to have this G and the H there and terms. And we have then a joint distribution, which is, of course, is going to be the product of each of these individual distributions. We can use variation or base then to to undertake the the analysis here.

18:45

Now, the choice of colonel here are this many different kinds of colonels, and so for images we want, then one that incorporates some similarity of information between patches. And so what we want to be able to try to do is to have Patch Group, Matrix factorisation. And so we want areas that are similar to to have to be in the same role to to base, to stay together effectively. So what we do then is we look at this distances Euclidean distance between a pair of patches.

19:15

This is the D. And we're going to define the similarity between that and the pair of patches as well in terms of that distance D. So we have a K, as you can see here, being a function of that Euclidean distance between the pair of patches. So we're going to have a pixel and it's nearest neighbours then are going to be modelled as a column vector.

19:40

And we can construct this M by N Patch group matrix Y by grouping the other patches with similar local spatial structures in the underlying one in the local window. So we're bringing together the m the common information in what we call a patch. And since each of these patches then a common then we can induce this low rank sparsity that we wish. So the overall algorithm then is to cost the patches with similar spatial structure to form a patch matrix.

20:16

And then we can apply our matrix factorisation approach in succession on each of these patch matrices. And then we can aggregate the patches to reconstruct the whole image. So this is one approach to being able to deal with a large image in order to be able to understand the low rank of the structures in that image. So when we look at virtual reef diver, for example, we might want image restoration, we might want classification of components like coral and so on in the image.

20:53

And you can see here the the the utility of this function. We have an original image. We've made noise in that image. And then we can retain or resurrect that image using this method. So the method then allows for integration of side information through the through the patches in first parameters and latent variables, including the reduced rank using variation of Bayesian inference.

21:23

And we get this low rank, three sparsity induced by an enforced constraint on those light, very light infector matrices. And we can show that this improves on some of the state of art approaches for image restoration tasks. So that's great, except we still have the issue of not wanting to aggregate the data in this case. We have two issues. One is that these images are quite large.

21:54

And so if we're going to create a single database to to analyse the images, that's going to become very unwieldy very quickly. Even analysing a single image is difficult because we've already seen how we break it into patches. And then we need to do the analysis on the patches. The second aspect is that we have citizens contributing information. Now, again, because of the provenance and the the data laws, the the data ownership laws in the UK, in Europe and in the US.

22:30

And coming into Australia, we're going to be really restricted soon about how we can actually deal with individual people's data. So we want to think about ways that we don't. We can't. We can avoid creating one database to rule them all. So what are our options? If we if we look at this? Well, our single database says that we put all the data in a single database and then we do the analysis on that.

23:00

We can also look at distributed computing. So here we can if we have a number of databases and we then can have a centralised computer, a centralised source that then distributes the the the the commands to the different databases and then receives the information back. So this kind of distributed computing is very useful.

23:31

We have horizontal scalability, which means that we can then add more databases or more nodes to this and this distributed system, and then we can obtain more or greater capacity by doing so. Of course, that comes at a cost because now we have computational interchange and those computational that overhead may be quite large in some cases. So there is sort of it. And there's a utility in having a vertical scalability.

24:02

In other words, a larger database. And then after point, we get to more utility by having this horizontal scalability. And there's a huge amount of work on distributed computing. You know, this this whole area is in in computer science and in statistical areas that are that focussed on distributed computing. We know that we can now we have a greater Folt tolerance in this case of one node pulls over. We have the other nodes. There's low latency in this.

24:33

And we can also use methods like shotting and so on. And I'll show an example of that in a minute. This distributed computing system as motivated approaches like MAP, reduce, Petchey stock, Hadoop and so on. So. This is out two of the approaches. Another one is decentralised computing, whereas our distributed computing had a central node that then governed the information guy out and back,

25:02

decentralised computing doesn't do that. So it says the information is processed in the cloud and there is no one actor that owns that information. This is led then to new work on distributed apps and also lead has led to edge computing and so many of you that you'll be familiar with this.

25:23

But I'll just explain it quickly. Edge computing is that we don't process the information on the cloud filtered through these remotes data centres, but instead we do the analysis in the cloud on the individual nodes and then they are aggregated as needed. So this is then led to areas, for example, like Federated Learning and Federated Analysis.

25:53

And so there are lots of advantages in this because this is very much the system that we're in with this or the area that we would like to get into with the two case studies that I showed. So in the first one, we have the Australian Cancer Atlas, where we actually have these different states that would like to retain their data and we would like to be able to analyse the data in situ and then bring it together.

26:18

We would also like to be able to add the virtual Raef Diavik case, be able to analyse the the images in a more decentralised manner. And also, when we come to integrating personal information, then we would like to be able to come up with systems where we can retain the province provenance and the individual privacy of of citizens, for example, and subjects in a more broad sense.

26:47

So if we go back to our Australian Cancer Atlas, if we look at a single database and we ask the question of how do we retain privacy when we we we have the data at this small area level. One option is that we actually analyse it at that small area level and then we're in the business of a single database. And so we've been looking at, for example, summary data analysis that we might do on the perform on take on the actual modelled estimates at the small area level.

27:18

So a natural way that we might think about this then is through a hierarchical meta analysis where we have our modelled log, I ask for each of the essay twos, the small area levels, and we might be interested in remoteness, cities, regional and remote areas. And then we have an associated standard deviation for that logger's CSIR. That's actually part of the information that's provided from the underlying Bayesian spatial model that I described before.

27:49

So we can set up then a hierarchical meta analysis in that in the way that we have. Why then the why? I for particular, within the regions of the cities, the regions and regional and remote areas that's normally distributed with a. We imagine that there's some measurement error or some uncertainty around the why I that's described by the signalised Sigma squared IJA, Ammu IJA then is going to this has been the the true or longest.

28:27

I ask for a particular region, remoteness region and they're going to have an overall value of Theta J and those Theta J then for each of the remote areas if we wish can also be considered to be drawls from an overall Lagus I r which is going to be described by theta nought. So we have our individual estimates of the log siac for each of the remoteness areas. They are drawn from an overall mean log assai our distribution of of Lagus.

29:03

I ask for that particular remoteness area and then the means of those areas. Then the remoteness regions come from some overall longest site. That then governs the the distribution of bogus areas that governs then b the the the whole structure. The sigma squared IJA stain can be related to our observed standard deviations s IJA through the usual course quid association. We can also add covariates to this, as you can see in the middle panel here.

29:42

And this is sort of a common approach for analysing auto to add covariance. And we can then look at some new work, which is when we look at our spatial Bayesian empirical likelihood approach in a meta analysis context. And that's on the right panel there. So this is fine for one approach that we might use for modelling, feed the summary data. And so we overcome privacy issues here.

30:10

And we could then also Arius could be more encouraged or different states could be more encouraged to provide information at this small area level. And we could perform analysis like this. We can show that if we do this, then we get quite similar results in these overall estimates because the remoteness regions like major cities, regional and remote, we get here posterior means 95 percent credible intervals and the probability that one region is greater than another region.

30:43

So that the natural outputs from our Bayesian models. We also can see then if we go to the individual essay twos and we look at those estimates that we we are getting differences between when we analysed the data at the individual level versus when we analyse it at the USA to label this aggregate level as anticipated.

31:06

And we can also see that our out empirical likelihood approach is going to give us quite a larger interval or spread of the data compared to our parametric approach as anticipated, because they stayed or are quite moderate and don't have those kind of outlaws as quickly as we were talking about before. So centralised summary analysis has some benefits and some drawbacks. The benefits are that we can preserve privacy because we have this aggregate approach.

31:38

It's a simple model. It's computationally straightforward and fast. But there is challenging inferential capability because we're modelling at the small area level. And and therefore, we may have we have to be careful of controlling for biases in this case. We also have some questions about what can we say about covariance. For example, is the spatial distribution of our response consistent with the spatial distribution patterns of our covariance?

32:10

And we have to be careful in what kinds of inferences we can draw. And those inferences will change as we change the scale of the data. So we're still get to scale this to different states and countries in a hierarchical, which you could imagine and how this would go. We add a hierarchy to those models that we've already described. We come now to distributed computing and so in distributed computing, then we're starting to think about how we don't have to create a single database.

32:40

And I've just put this slide up because this was a burse workshop in Canada. I wasn't I wasn't at this workshop, unfortunately, but I just wanted to show you here some of the topics that were covered even in 2008. And this was about to start on developments in statistical theory and methods based on distributed computing. As I said, this is a huge area in computer science, but also there's a substantial amount of work in the statistical community as well.

33:10

You can see here approaches or focus on statistical methods with computational efficiency, with statistical properties, with guarantees about the estimates. These things are really important for these kinds of approaches are robust. What's the divide and conquer algorithms that really underpin a lot of the work in this area? And rights of convergence for our estimate is here. Mathematical theory for deep, convolutional neural networks and distributed learning.

33:40

A lot of work on distributed neural networks. And and as I said, divide and conquer methods, for example, for correlated data and spatial creaking, which is of interest for us in our spatial analysis. And so you can say here that that there are these kinds of issues in the distributed computing that really focus on the statistical underpinnings of these particular methods.

34:08

And again, just pointing out that one of the main approaches here is that the approach that many of us are very familiar with, the divide and conquer algorithms. So just as an example of some of this work, we have distributed Bayesian inference s, for example, Wang and Dance Donson and there's weighing in this 2015 work. We're really we're looking at decomposing the global posteriors.

34:38

So we take the global posterior distribution and we're going to break it up into a product of subsamples posteriors. And so that's the second equation that you can see there, which is going to just be conditional on the Z JS instead of all of the data Z, so that we have this prior beam that's raised to the power one on K and. And we have this DJI, which is a normalising constant. But it really what we're doing here is we've got these subsamples.

35:10

We can start the posterior distributions from those subsamples. We bring those together and we weight them according to the. So this sort of like this weighted combination or the weighted combination of the likelihoods. So this approach then really motivates the macro Joose approach. So we run separate Markov chains in parallel on different machines based on the local data. We transmit this back to these local posterior draws to a central node.

35:40

And then we combine those jaws to form an approximation or a surrogate likelihood that approximates our true global likelihood. So this kind of approach has said Spain. It's familiar to us. There's been a huge amount of work over the last 10 years in this area. So if we come back to the virtual reef diver approach, we think about how this kind of approach might work. In our case, an example I'd like to just talk through with you is where we were looking at crowd sourcing by Mechanical Turk.

36:15

So we wanted lots of citizens now to to classify our images. And so we used Amazon Mechanical Turk to recruit people. So this is an online system where people often pay, but people will undertake tasks. And so here we had we asked them to classify our images. Each person was assigned up to 40 images. They were asked to classify points. And we paid them for this. We got thousands of people who classify these images.

36:50

And, of course, then we needed to do something with the data. So what we were we and one of the issues is, of course, that we need to take into account that the ability of the citizens who are analysing these data and we have to take into account when we're looking at their ability. The difficulty of the images that we asked them to annotate. So we think that this in a rash model or is a common sort of model to use in this case.

37:18

So this is the item response model and a three parameter logistic crash model is is useful for us. So we expanded this model to say, alright, we'll use the three parameter logistic model, we'll add spatially dependent item. Difficult is the key to this model. And then we'll also expand it. And you give me. Yes, you're back and you left us for a little while. And it's no longer secret in sharing. Right, AK. Great. Thank you. But today, I get to it's. Okay, so. Okay. Thank you. All right.

38:50

So we want to be able to analyse the data from the citizens and so we can use this three parameter logistic resh model that has a spatial term in here in the spatial term is going to be based on the images and say images in a particular area are going to be similar in terms of their difficulty. And this kind of model then allows us to identify the latent ability of each of the uses, adjust for the difficulty of the the images also allow for discrimination parameter.

39:24

So that's how quickly we we can we were changing the probabilities and also account for guessing or pseudo guessing in this case. And so what we want to do here is we want to be able to identify the abilities of each of the citizens and adjust for that. So what we can do if we if we do that, I'll just show you here, is that we can say we can estimate the ability and this some this thought here shows that we did quite well in that. So we have some test data. We can assess their their ability.

40:00

In a small sample of the cases, we looked at their estimated ability and we can pretty much get that right from this approach. So this means then that we can only take the people who are doing a good job, who have high ability, and if we can also identify training or personalise the training for our citizens. So the options here when we have all of this information is again a divide and recombine approach.

40:26

And so we divide it into if we were to use a common divide and recombine approach, we'd split into multiple shards or subsets. We'd fit the model two into independent subsets on independent machines. And we combine those posterior estimates in into a global estimate using some sort of consensus Monte Carlo. And we'd have weighted averages of the posterior M CMC chains. That's as I described.

40:52

So then an alternative is to divide the users into ten equal groups with respect to the number of classifications that they did. This works very well in our case because what we end up with is about 10 shards. With about half a million classifications per shot. And so then we fit the shards in parallel and then we combine combine that using some stratified sampling approaches. And we found this kind of alternative to the usual divide and recombine works really well here.

41:22

So the benefits and drawbacks of this distributed computing approach is that we have efficient computing. We avoid repetition of these tasks, that because we have this distributed setup, we are we have a wealth of tools to do this. Now we have privacy prevert preservation in some sense because the we have the distributed analysis and we can also increase inferential power by combining the datasets. We do have, on the other hand, the potential for data leaks.

41:55

We have we can have possible biased and variable inferences if we use just the naive approaches. It's usually limited to point estimates. It's more difficult to get uncertainty estimates, although that's changing and we still have some issues about time, degeneracy. And we have to be careful of that degeneracy of the weights that we have on the partial likelihoods or the the local likelihoods on that. And there's a lot of work on avoiding or addressing those issues.

42:30

So I'd like to come now to to the approach that we're investigating currently, which is around Federated Learning. So now this is the next step out. So this is really decentralised approach where we want to analyse the data without the data leaving the source. So this can then help us to avoid the ethical, legal, political, administrative and computational barriers to combining data from multiple sources. It increases control for the the the the owners of the data.

43:03

There is a potential to improve data quality and timeliness because the data will be managed by the data owners. And we also have then increased inferential capability if we by bringing together small or dispersed datasets. So the way that the federated analysis works is that we have groups, we have the parties, the manager and the communication computational framework. We have then the components. So we want to petition our data. We have a model.

43:36

We have a privacy mechanism and the communication architecture. How are we going to get all these actors to talk? To each other, and then we also have the modelling approaches and. And this is where we can really play a role here. And combined with the computer scientists and the information technology people in in making this federated approach. So modelling approaches range from date, new networks based decision trees, regressions, support vector machines and so on.

44:08

You'll see the sort of the the the image of the federated analysis. And below that, I have a figure also of the type of datasets that we might have. So we might have data that split or petitioned horizontally. So each of the the nodes, the local data, not local nodes, has all of the data. So the whys and the Xs that are required for the analysis. It's just that they have subsets of that that whole dataset. The other option is that we have data that split vertically.

44:43

So one nodes holds one of the Xs, another node holds another X and another one holds the Y. And then we want to bring all of these data together. So there's different ways that the data at a petition. Now, there's a lot of approaches for federated analysis. But this is really sort of just in the last few years. So there's some commentaries that are useful to to to get an overview of this.

45:06

There's an idea of having a repository with data sharing agreements and these repositories of very common now. And there's a whole lot of work on developing these. But the question is then about doing the analysis once we have these repositories. We also have Bayesian networks as a way of being able to include data from different areas. And this is used, for example, in this sample for survival prediction.

45:34

There's neural networks to a lot of work on your networks. And then sequential and hierarchical Bayesian models. For now, we're starting to get into the correlated data for time series data. And also for spatial data. And there's also some work on very nice work on communication, efficient approaches. I'll just talk about those briefly in a minute. So we have for the federated sharing of genomic data sets, we have these. An example of genetic status it's in this case is an example of this.

46:05

Some is federated sharing of internal registry sense here. So this is really going to be controlled. Each group controls its own data. They can dynamically add or drop instances from the group and they can make agreements about the group. So there's quite a lot of sort of federated or autonomous, some autonomy in the way that these groups might engage with the overall registry. This example is from from Jordan and colleagues.

46:43

This work started around two thousand and fifteen sixteen when the archive paper appeared and then was published in 2018 and there's been extensions of the work since. But basically here, the way that this works is again, a surrogate likelihood approach. So we start off by initialising. They did not. And then we're going to transmit that value to local machines. The current day it to the local machines.

47:11

We're going to now create the local gradient of the likelihood at each of the machines and we're going to transmit that machine. MJ, we're going to transmit that to two machine. And one, we're going to calculate then a global gradient on machine and one, we're going to form in the surrogate function there. And then we're going to either do two one of two things.

47:36

We either update the THETAS directly or we're going to then have some sort of one stop, one step quadratic approximation to get our new values of thetas. Then we will do the whole thing. We'll transmit those. That again will update the the gradients. We'll bring those gradients back in. We'll combine the gradients, compute this the the the likelihood and so on. So it's a very computationally efficient way of using a surrogate likelihoods through the gradient.

48:08

So this Bayesian approaches to this, I won't go through it, but it's really a similar sort of thing where now we're not going to be transmitting the posterior drawers anymore, which is computationally intensive. We're going to be transmitting these some of these gradients. And then we can show then for logistic regression of images in this type of by Jordan et al that we can see from the plot here, that tag that this approach is really gives us a much better classis classification.

48:42

Right. Compared to the usual approach. I'm going to skip this example other than to say this is a spatial modelling approach, which is very useful. And one of the key to the future approaches that actually takes into account this some strongly correlated data and spatial approach that we have here. Am I going full time? Crystal? Oh, yeah. If you could finish relatively soon, then we'll have time for another question because we've got a few in their chat show.

49:18

So coming back to our Atlas of Cancer, then we can combine cancer data sets across each state and count countries. That's what we would like to do. And so we partnered with a group called with them, the Netherlands Cancer Agency. And so they've been doing some excellent work about this time last year. I was in the Netherlands to talk to them about this. And they've done some excellent work in the meantime, a collaboration called URO Care.

49:46

So this is open source federated analysis, which really demonstrates that this can be done. So achieved. And so if any of you are interested in this, I would recommend checking this site out. That is the site and teach six, which is open source privacy prevert preserving federated learning infrastructure. And they've got some great examples of combining data from the Netherlands and Tyre and Taiwan, combining data from different agencies and then combining data from the Netherlands and Italy.

50:20

And so different kinds of federated approaches. So there's a federated approach. And you can see on the the right of the screen an approach where it talks about how these individual nodes work. So if people are interested. So you can have a federated generalised linear model approach where you can you can calculate the MLA for BTR via phishers scoring. So what you do is you calculate the the individual terms at each of the nodes. You then aggregate them to the the the global node.

50:59

You get a beta out of that. And then the beta goes back to individual nodes and so on. So we can get then the original version of your care was where you had each of the individual countries. They all provided information to a single source. And then you got your estimates. B, the federated version is that the data stay where they are then the sum. This approach works as I described before.

51:28

So you get these individual values and then you get your estimates and you show the estimates are relatively the same. So for virtual reefed either then we've done a similar thing with them. Federated analysis here with Federated Matrix factorisation. So we've taken a fact Matrix factorisation approach that I showed before. And we've included an adaptive learning rate. And we've used to Capstick gradient descent as an approach here.

51:56

And so we've we've effectively decomposed the stochastic gradient descent approach so that we can use it in these federated approach. So what we get is a dynamic method. It's some it's privacy protection. It has learning efficiency significantly reduces the training time and the and maintains high predictive accuracy in this. So I won't go through the details of this, but. But you're welcome to to talk to me afterwards or we'll see from the slides.

52:28

We do show that that has some strong advantages in this federated approach. So in summary, then, our data are changing. We have size, privacy, provenance, quality, diversity issues, federated analysis and Federated Learning offers some solutions. But we do need statistical methods. And combining it with their implementation, we've shown talked about two case studies and then looked at a number of approaches.

52:56

As I said, this is work in progress and I'll be very happy for any comments and suggestions and I'll hand it back to you. Christo, thank you. Thank you very much. Unfortunately, we can't give you around for us. But for the people who say thank you very much indeed. So we have some questions already in the chat. And so one of them is about what assumptions you've made when you're analysing cancer data about the surveillance.

53:27

So what are you getting a fixed proportion of cases or do they vary across areas? And then there was a bit of a follow up from the same thing is how do you account for different data collection methods in the different datasets that you're bringing together? In the cancer data, we're fortunate in that this census basically said the cancer is a notifiable disease. And we have a count. We have registries now. The registries aren't perfect, but but they're very good.

53:59

And so and so we take these data as given. You're right, though, in terms of bringing together data sets in this federated approach for that may be of different quality and also different representation. So those kinds of issues, you as statisticians, we have some good ways of being able to deal with them with data that we need. We have sampling approaches that we have some sort of uncertainty that we might associate with them with different kinds of issues around sampling.

54:34

And so we can bring those to bear on this federated approach. And I think it's really important for us, the statisticians, to be part of this. This move to these sorts of more decentralised analysis approaches. And so also relating to fisheries learning and how things are moving, would you be able to do model checking or model learning in this context? How does that affect these approaches? Yeah.

55:07

I think this is a good case. The same as just the whole model uncertainty in the in the distributed computing case. And so this people at Oxford who's who have been working in this area a lot, in some ways the I think the federated approach actually might help us to to understand more of that model robustness.

55:31

So if you think about this and when we can actually think about like, what are they surrogate likelihoods so that when we create these surrogate likelihoods about the local likelihoods and, um, and perhaps we can learn more about how robust our model is if taking into account the assumptions about the different nodes. But then what's how likely what are our likelihoods telling it?

55:53

It's almost like we have replicates of the experiment, if you want, depending on whether we have the vertical or the horizontal petitioning. But we do have an opportunity, I think, to learn more about our models, perhaps in this situation when we set up. OK. And then more of a political context question, which is what would achieving spatial equality mean in the context of a cancer atlas?

56:21

So, in other words, is that the worst performing area is being raised to the level of the better performing areas. So, yeah, so. Well, that's what you would hope, is that that you would have it. And that's why we were looking at, for example, how many deaths might we avoid or other kinds of measures like that. How much morbidity would we avoid if we were able to resolve that spatial inequality?

56:51

So typically, even in a hospital situation, you might say what we would like to do is to get to the 80 percent mark. So we would like everybody to be at least at that 80 percent mark. You know, it's one of those. I look forward to the day when everybody earns above the average wage. But that's pretty much what we're trying to aim for here. OK. And then there was a more technical question, which was about a specific parameter.

57:16

What is the impact of the choice of kasib you? For instance, why is this taking today the power of a fourth? Yes, sir, the the the power of the fourth was because of the distance metric that we used. And I'm happy to go through the details, so we'll talk about that. And that distance metric then was the way that we created these some of these shots. And then the case up, you was actually being developed on each of the shots.

57:51

So the perimeter that was really inducing the sparsity of those those vectors was really the gamma. That was it was in both the U and the V variant, the variances for both the U and the V turns. Thanks. And then I have a question about the the people that you've got sort of evaluating your images. And so when you're recruiting nonexperts to do that, is it possible to somehow. I mean, if you have only a limited number for each person, it probably doesn't matter so much.

58:29

But if you were going forward with people over a longer time, is it possible to provide feedback to help improve the performance of individuals? So then. I mean, I suppose you could smooth everybody in the same direction, which wouldn't necessarily help. But I just wondered whether or not it you can have a sort of feedback to improve performance. Yes, definitely so this. This couple of things fun, is this a learning parameter in that in that equation that we have the three plus model?

59:01

And so we can we people will learn as they go along as well. But if we can certainly have feedback, in fact, what we're doing at the moment is creating an AR to VR package so people can actually then use we then create virtual reality and then annotate. So I have eyes like artificial intelligence so we can have questions and and reminders and training actually appear in the virtual reality world. It's pretty exciting moment, but that's a topic for another time.

59:37

OK. So what are the feet where the bits of feedback we got was what an inspiring talk. This has been to to finish the week with which is perhaps a bit premature for many of our weeks. It depends on where you are, how close you are to the end of your week. But it's been great to hear from you. It's pretty interesting to to have such specific examples that are both important and and also very different. So you can see how it's being used in a number of contexts.

01:00:06

And that, I think, really helps us envision, you know, the wide range of other opportunities where this can be brought together. We're also in a particular context within the U.K. of just having gone through Brexit. And so that does change some of our data sharing relationships with other countries.

01:00:24

And so it would be important for us to think about these federated approaches and how this might be used to avoid running into complications with data sharing and where different datasets are stored, that we can still get the benefits of all the data that are out there without having to put them all in the same place. Yeah, that's true. I think it will be it's a challenge that all of us are going to need to face into work.

01:00:52

Right. We can certainly learn from each other. Well, thank you very much again for spending part of your evening with us. Thank you very much for the opportunity. It's a great honour to be able to speak and and especially for such a great time, for such a great cause and a great memory. So it's fantastic that you win. You remember people in this way, and I'm proud and pleased to be part of it. Thank you. Thank you very much. And we we still owe you a dinner.

01:01:27

So at some point in the future, you don't really speak again if you're in the Oxford area. Oh, OK. It's a deal. Thank you. In a more traditional manner. Okay. Thanks very much, Crystal. And thank you very much, everybody, for attending today.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript