Strachey Lecture: Privacy-preserving analytics in, or out of, the cloud

00:03

Okay. Good afternoon, ladies and gentlemen. Let's begin. My name's Mike Wooldridge. I am head of Department of Computer Science and it's my very great pleasure to welcome you to the Hillary term straight lecture. As always, when we begin the Straight G lectures, I'd like to acknowledge I'd like to acknowledge the financial support of Oxford Asset Management who sponsor these lectures.

00:26

And since their sponsorship we have been able to step up a gear in terms of the kind of lectures that we're able to offer. Getting into this lecture theatre is not in fact free. We have to pay for it. So when to use facilities like this to use fantastic facilities like this is only possible because of Oxford Asset Management support. Nevertheless, we did get them in a feature film. I hope you've all watched AlphaGo the movie.

00:53

If you haven't watched AlphaGo the movie, you should go away and watch. It's on Netflix and Straight G lecture from exactly two years ago actually appears at the beginning of the movie. John, I'm sorry, I can't promise you're going to be in a feature film today. But nevertheless, it's my enormous pleasure to welcome John Crowe Croft To give this lecture. John is the Marconi Professor of Communication Systems at the University of Cambridge, where he's a fellow of Wolfson College.

01:20

And it's difficult to summarise John's work briefly because he is just simply so tremendously active. But in terms of the UK, he is, I think it's safe to say, the UK's leading academic in the area of networks and has been he's been setting that agenda now for at least three decades, has a tremendous track record of some of the early work in in networks. So when we think about digital immigrants and digital natives, I'm a digital immigrant because I grew up in a world without the Internet.

01:52

My kids are digital natives because they grew up and it was just all around them. But John was a true digital pioneer. I mean, he was one of the people that was creating the protocols that just makes the whole thing work. So he's a fellow of the Royal Society. I'm a Facebook friend of his, and one of my daily pleasures is seeing updates from John. They are the most entertaining updates that you can imagine on Facebook. And what can I tell you from following them?

02:19

Facebook, I can tell you that he likes music and pubs, so you should really move to Oxford because you'd fit in really, really well. John, it's, it's my great pleasure to welcome John Croker off to give the lecture. Thank you, John. Okay. So settle down. I have 92 slides and we have about 50 minutes. And my mistake, I thought I had 2 hours and I thought I had an hour and a half. And on the train I realised I had an hour and 50 minutes and so, so keep up.

02:49

There will be test at the end. So this is a work in two projects, both between Cambridge and Imperial. One also has partners in Nottingham University and the other one has partners at the Turing is two, where I spent half my time in London where they have two pubs of music. Although I'd say, yeah, Oxford's definitely had a Cambridge in both those steps. So and this is this is an area we've been mucking around with for a while, which is something that I think impinges on all of us.

03:20

So in and out there in the cloud, there are two kinds of class, a broad classes of data. There's all the stuff that we voluntarily stick on Facebook or Twitter or Instagram, whatever, all of that. You know, it magically emerges from our mobile devices in our pocket and so on.

03:38

And then there's stuff that people that have large bodies of curated data like Bill Health Service, who may be working with DeepMind or or a financial institution and trying to do fraud detection or, you know, figure out the next Facebook and so on.

03:55

So so there's this this is large dataset over here, which is kind of curated and comes from some large organisation and then is that is highly decentralised sort of crowdsourced bunch of data over here that's then centralised by some of these agencies.

04:07

So what we've been staring at is thinking about the, the issue of privacy and the fact that all this data is being put somewhere for doing analytics on it to, you know, maybe monetise it, maybe it's you are the product for Zuckerberg, but maybe also to work out something that's a better diagnostic tool, predictive tool and so on on health care data. So which could be, could be making money as well or something for social good.

04:31

They'll be figuring out, you know, what's the best investment in the future of universities in terms of training and education. Maybe it would be to keep our pension the way it is. Oops, political. Okay, so that's the last thing I'll say for now on that topic. So so there are there are two pieces to this talk and I'll probably run out of time just at the end of the first piece. Apologies for that. The slides will be available, I think so if you're interested.

04:56

In fact, they're linked off my home page as well. So the second bit will be in there too. So the first bit is about privacy, preserving analytics and a centralised cloud. And this comes out of the work with Turing and principle, with peer to peer support group in large scale distributed systems, Imperial College, London and folks in Cambridge.

05:14

And what we're interested in is we have this large amount of curated data, so it could be partners that we have worked with, with the NHS, Scotland, 1.6 million patient records with upwards of 10,000 variables kept about every patient. And this stuff is naturally in some senses centralised across data centres in hospitals. There are people in Oxford working on this stuff, doing some computer science here and other folks doing cool things with this.

05:39

And then the other, the other sort of large amount of centralised data is financial. HSBC is a partner in the Turing House. I think about 20% of the world's transactions go through their systems, a large fraction of UK transactions, and they're interested in, you know, predicting what's going to happen in tomorrow's book for trading, but also detecting fraud because they're required to look at it and so on.

05:57

So there are motives for putting their private data centres data into the cloud which are effectively cost saving exercise. And in fact there are things they may need to do they can't afford. I was surprised that HSBC said that actually some of the things they're going to be required to do for detection are actually beyond their capability,

06:13

not just in terms of number of people, but, you know, in computation. But if you do the cost of renting those resources on servers, yeah, then it comes out somewhat better because of a number of huge scale scale up properties like Amortising power supplies and cooling and efficiency, operational reduction cost by having multiple customers and so on, and also statistical multiplexing of the resource.

06:36

And there's a lot of reasons so and there are other motives for putting things into the public cloud. Okay. So there's issues here about legality, which one go into there's not really time that you want the experts in some of this they're actually pretty near here in. Oh, I have a bunch of really really people have written about this stuff. But there are rules about where you do cloud processing on PII, personal identifying information, in particular in health care.

07:01

Financial is very strictly controlled in the US as well as in EU and UK is in line with that and so on. But there's some practical things. Even if you stay within a national jurisdictional boundary, you want to keep your data encrypted and installed, you want it to encrypt it when you transfer it. And post-Snowden, that's been fairly standard. If you buy storage on Amazon or Google Cloud or whatever, and then in rest encrypted and people default to agents and transfer.

07:28

But you'd also I'd like to go a bit further. You'd like it to be encrypted during processing. Because there are threats. I'm going to talk about those threats because that's what we're trying to mitigate. Okay. So what on earth with the threat to be to when your processing data is coming off the disk, going to the CPU? What could possibly go wrong? There are also a whole bunch of other things I really don't have time to go into about, you know, key management across multiple organisations.

07:56

A horrible, huge, massive problem always has been. I don't see any stop to that. But the bottom of the bottom line on this slide is sort of the word enclave and secure enclaves is something you may have come across. It showed up, I think, in the FBI versus Apple fight over a terrorist's iPhone. And Apple was like, oh, we can't actually decrypt this phone for you. Sorry, actually isn't doable by us and not at least in the affordable way.

08:23

In fact, it turns out there were some workarounds for that obscurely. It doesn't matter about the details, but the reason is just to do with the kind of technology they use for where keys the cap. But actually even if you had been processing things on the processor that that iPhones use, which is kind of arm variant, there's technology for running a trust zone, as they call it.

08:43

And that's a cool thing. And Intel have a similar thing, which is what I'm going to talk about called SGX, which is software got extensions of the intel processor, which you could use to gods. What's going on in some senses during processing okay up to some limit AMD have another technology which is halfway between what Intel and arm do. Cherry is a Cambridge local specific thing, which we do, which we built some hardware which does a simpler and better thing.

09:09

But I don't have time GDPR as legal background. I don't have time to go into the legal background, but this is why you care. You need to make your best effort at keeping people's data secure. If it's health care or it's financial and it has PII and it, then you are into very serious fines if you get things wrong. I mean, not just talking about, you know, a slap on the wrist and $50,000, we're looking at 95% of GDP a year while you're still doing the wrong thing.

09:33

Not funny. 5% of your gross profit for your company. Sorry. Okay. So the project we have at Imperial and Turing is called Meru for obscure reasons. And what we're interested in is trying to see if we can do analytics in SGX in this extension to the Intel processor. And so we going to dive into a bit of detail here about how we're doing that. So I'm going to talk a bit about trustworthy data processing in an untrusted cloud.

09:59

That's kind of the starting point here. So we're looking at people have a lot of curated data is high value, the bad guys out there will want to try and attack it and it's in central locations for good reasons. You might want to do machine learning over this data because you might want to run, you know, create a a Bayesian model of the data. You might want to do some interesting image processing over all the retinal scans, all the images, a whole bunch of things you might want to do.

10:24

And then having trained up those systems, you might give them to chips. So when you walk in with some extra symptom, the GPU runs the thing on you. No privacy problem at that point. You've got a direct relationship and they all you need to go to the hospital right away for an eye operation because BLAH might even have a model which we have done, which will predict that if you don't get in by Tuesday, Wednesday will be a day too late,

10:45

which is, you know, the sort of thing that was high value to people. Okay. So we're going to have a look at what's the what's the underlying and problem space, the a bit of an overview of SGX and and so on and how then we map a machine learning or a data analytics platform onto SGX. And at the end of this, I'll try and remember to say what the what the shortcoming of all this work is so trustworthy data processing.

11:10

So the cloud has taken off because a lot of people kind of trusted the cloud provider and the cloud provider didn't trust the people, the users. So the model traditional model is you've got sort of trusted operating system and harbour and the cloud providers go out of their way. They have very, very good processes.

11:26

If you go visit Google or Microsoft and go to an as your site, their processes for managing, you don't get to get, you know, administrator log in or, you know, to to sue do anything that just doesn't happen. They're really, really good physical access control. They really pretty good about that stuff. And what they're trying to do is, you know, to isolate users from each other. So and they use virtual machines.

11:47

And this is where I came in way back when in Cambridge, we were building a hypervisor called Zen, which is widely used in Amazon and other places. So for running multiple guest operating systems, virtual machines, and you get protection between those kind of. Or do you. Users trust their application. But why should they trust the cloud provider? Why should they trust the cloud provider?

12:08

So historically, back in the day, when when there's no hypervisor for ships in use, that would be a cert, you know, literally an alert. There's a vulnerability about once a week. On the hypervisor. That means that a vulnerability exists such that some guests could run an app which could go and look at all the memory and all the other operating systems via a vulnerability in hypervisor. So at this point, you fix the bug and you reboot 1 billion virtual machines.

12:34

At which point a lot of customers get a bit annoyed. Right? But if their data was sensitive, that vulnerability exploit could have been the bad guy. Reading all of the MPs health records when they presented with weird symptom X and then publishing it to some scurrilous newspaper or worse, you know, attacking the entire financial system tomorrow by fiddling with some of the numbers. So so this is an issue. So a solution for this might be to run a trusted execution environment.

13:05

So to have some kind of way of supporting isolation in a way that the hypervisor and operating systems do not have the privileges to read across these application domains. So, you know, long story short, Intel built something to do this way back when arm built this. It's a little easier on an arm. People familiar with, you know, risk processors and ARM will know it's a highly regular architecture and adding some new thing to it in a coherent way.

13:33

It's doable. And ARM also have a pretty nice full model of their systems. When they had a feature, they can figure out the consequences. For Intel, it's incredibly complex. How do you get a thing to have this thing called a micro architecture, which leads to all kinds of problems?

13:46

But the idea here is that essentially you reverse this whole structure where users run their application process and it tells the operating system, which has more privilege, which talks to device drivers and storage and so on. And if you run a hypervisor that has even more privilege because it can see all these OSes and you flip that around and say, no, the application can enter an execution domain, if you like, within which is sandboxed by the hardware.

14:13

Okay. So that's what SGX or an enclave is sort of supposed to do, a trusted execution environment, lots of different ways of coming at this. And there are there are several other pieces you need. But the idea is this this potentially saves you from the vulnerabilities in another device or the OS or even in library code, perhaps in runtime breaking things for you potentially. So that's shipped on various recent Intel processors.

14:38

And as I say, there's an equivalence in our processes, an equivalent lambda and so on. In fact, the risk free project also has a design for an equivalent. So so there's a kind of marketplace in these things which we'll see has a bit of a problem. Okay. So SGX is this trusted execution environment. And so now you're not trusting us anymore. Your code starts off by entering an enclave somehow. And this can provide confidentiality and integrity.

15:07

The integrity there checks on all you really running on it on SGX. And then then is this you know, it's this really I mean, actually running on this processor. And if you can in talking to another processor, is the other processor able to tell you are who you say you are as well? So there's a whole bunch of other technology in here, but basically you have an enclave code and data and thread support in in this particular where support for this this sort of world.

15:33

So it's an extension to what is already a very complicated instruction set architecture. Anyone who's ever read an instruction set architecture book, I think the last one I read all of and understood was a PDP 11. The arm is just about double. If you were teaching computer science, you know, processor architecture 1 to 1, you might teach from a Hennessy Patterson's fantastic book, which would be about a 32 lectures to get about halfway through.

16:00

And that's when the MIPS processor, which is quite simple compared with any of this. So but anyway, so Intel, God bless them, have added this confidentiality and integrity checks for going in and out. But also crucially, as I mentioned, the bottom line here is you want your data to be encrypted storage on disk, SSD, a lot of the time in transfer open networks over links.

16:21

And now this supports encrypted memory. So if you're sitting there, you've read recent events in this book, but what about the cash? Now it's about encrypted RAM, so it's a memory controller which can do encryption and decryption of fetches from RAM. Okay. So, so that's, you know, that's, that's the sort of first piece of SGX that kind of there's some extra protection and there's some magic associated with each and every separate intel processor shipped,

16:50

which does a bit of the keys for doing this. Okay. And this is just I really don't have time to go through code examples. So this is just, you know, a piece of code that doesn't answer in a piece of code that runs in the enclave that gets a message which said the user ship say off disk or off network, then that code can safely decrypt, do some processing in the middle, encrypt the output, copy the message to the output, the result buffer and so on.

17:16

And there's an interface to this. There's a sort of enclave enter and exit, but there's also their ingress and egress calls into the system. So there's when you construct the enclave in the first place, you need to know your code got there safely and it's done sort of page at a time, move the code in there and then at the end of that to say there's an enclave measurement process where the CPU can calculate measurement hash and then just say,

17:45

have we got the right code there? Are we talking to the right person? And the second piece of this is apart from local attestation, second is remote attestation. So we need to just talk about two different enclaves talking to each other and remote attestation. So these are those pieces I just skipped over because they're fairly standard crypto protocols for doing that kind of thing.

18:06

If you think in terms of if you ever used to email anyone ever used secure email, it's there's no repudiation of who sent this to you and who you are to receive it and so on. So that's kind of what you're getting from that. Okay. Hardly anyone ever uses email. Very strange. But anyway. Okay, so that's what that's all about. There are some interesting limitations which very, very much matter. The current one, the top is the amount of memory you get encrypted on.

18:35

The current SGX intel processor is extremely limited. You really don't get very much by my standards with the PDP 11, a couple of people in a room in here might know the first sort of LSI 11 had about 56 K bytes or usable memory and we used to use that for multiple users running version six Unix for log in over Cambridge rings. It was fine for K by some memory. So in this terribly impoverished world in terms of encrypted memory, you're going to get about 90 meg of encrypted memory.

19:06

And of course everyone's probably used to writing analytics program where you glibly throw some small core application and you go, Oh, it doesn't matter. I got four gigs on my laptop, and then if that's not enough, I can put 16 gig and I can run it in the cloud and it offered terabytes of whatever. And there's also some overheads going in and out of the enclave that are non-trivial, particularly if you exceed the memory and you stop paging.

19:27

There's a massive overhead because you have to software the pages to move them out. And so so it's pretty, pretty scary. Okay. So there's a bit in the middle there which we'll come back to. But so channel attacks are possible. So never said they weren't. But in general, if you're careful about how you do things, they may be quite hard to use or they might have been until recently. Okay. So what we done that, we didn't do that.

19:51

That's all. Intel and other folks have done similar things. I mentioned ARM. But what we've done is we wanted to put arbitrary applications into this. You could take your application to this and call you enter, edit the code, put it in, compile it and run it and you'd be you be using the enclave in some way. But we said, how about we do arbitrary application support, you know, so we may be running a JVM or dot net or some other support and so on.

20:13

And then we also need to talk to the outside world. So we need to talk to what we have to do, like loading. And then we need to talk to file systems, deal with signals, do networking and the process, those bits of code better, do the the ingress and egress and the enclave security because they're talking out of the enclave at that point.

20:32

So they have to do the right thing in terms of crypto. So. So Peter Slattery and Imperial have built this Linux kernel library to simple arbitrary Linux applications, anything that lives in a fairly, fairly standard alpine in its world. We'll just run on this. This is kind of cool. So anything, any binary that run an alpine index will run on this Linux kernel library.

20:55

And the idea is you just edit the relevant bits that are kind of library said that it enters and exits the enclave and then application running on that is now running in the enclave or external enclave. And anything that wants to cool networking or disco goes through the appropriate libraries which do the appropriate crypto in and out.

21:11

If you're sitting there thinking security people, I recognise the room thing yeah but you know networks that crypto is any good had you know your disk IO crypto is any good and how do you know Intel's memory encryption is any good. Well, you don't. But, you know, you might have gone mobile, checked some of them. We have another project which is doing this all in Carmel and we have a little checked stack, so we think that's okay.

21:35

But that will give you network crypto. Okay. You could put that into this world. So that's really what's going on there. There's a bunch of other pieces you have to look at or is this very standard consistency stuff. And I have to go through it pretty fast because it's background.

21:51

And the idea, though, is you're going to have to deal with your memory management, your your system call stubs have to be implemented so that because you can't do an actual system, cause you can't do a track in and out of code, because that trap is itself changing privilege level and it's what you're trying to get rid of. So try to do that. Okay. So the thing we wanted to do with all of this, to cut to the chase was to do some big data processing.

22:16

So. You could choose lots of different data processing analytics platforms. Actually, the first thing the Imperial Folks did with some guys from Germany, they had a really nice project where they put Docker containers into SGX. That was a clever idea because that put anything containerised potentially into an enclave. And as a paper about that, I think in OCI a year and a half ago goes by the name of stone for secure container extension for next year.

22:45

Anyway, so, but then we thought, well, actually that's, that's, that's too general. Let's go one less general and let's take a particular data processing platform. And the you know, interesting one of choice might be SPARC. Hands up if you view SPARC. Who? About three people. Anyone use Hadoop? A couple more people, many people who reduce.

23:10

Okay. The same kind of people. Okay. So if you have a lot of data and you want to paralyse and distribute things in a data centre, cause there are lots of processes there and lots of racks of processes, then these are a fairly starter set of tools that let you do parallel distributed computing over a over a datacenter environment.

23:29

They're not the same tools you would use in an HPC, very tightly coupled cluster computing platform where you'd use some PBM style of system, but they're very, very widely used and SPARC is kind of SPARC is particularly state of the art in machine learning styles of tasks because it has a fairly nice way of dealing with asynchronous and redundancy that actually sort of scales quite well. And SPARC, it's usually kind of coupled with R and R is a language package which is it derives from S,

24:03

which is a very, very commonly used statistics package. So there are a lot of good reasons to to take SPARC as an example. One of my students who I didn't really supervise, who's way too small and did all his own work, he currently works for Microsoft and and as your research and things and he's done this for Hadoop and for SQL Server and for some other things. So I think he's you've got a blockchain running in SGX, which is kind of cool.

24:28

But we chose to do this with Spot because in our friends in analytics and machine learning, that was their kind of principal current tool of use. You know, people out there might say, Oh, well, actually, you know, I'm using TensorFlow, you know, because I'm a real neural net person. That's the tool. Well, we haven't done that, but one of the things we're doing with this work is documenting what we have to do so that somebody else could repeat that work.

24:50

You know, just take the runtime for the data processing for the analytics platform and redo it. So what's the interesting issue is Spot, which is basically it's kind of still doing a MapReduce style of computation where you've got a bunch of data in each node and each node maps a function and reduces that, and then you share all the results across all the nodes and then move on to the next step.

25:13

So. So what we want to do is to take that code that maps and and reduce its functions and put that in an enclave and then have the data come out of memory because that's where it's going to be is going to be in one of these ads that Spock uses. And then in the enclave, be decrypted, have the function mapped over, iterate and then move on to the next function and then do that and all these notes in parallel.

25:35

So what S'POP is written in basically has to has means that we have to put a JVM into the enclave. So this gets exciting. So, you know, I've said all this, we could map other things into the enclave, but we chose to start with spoke with documents in such a way that other people could use it. One of the things you might be thinking out there, if you do any large scale machine learning, you might be going, well, what about the accelerator hardware that people use?

26:05

So people in. DeepMind and many other places Microsoft anywhere, Facebook, wherever. Anyone doing machine learning probably will use a GPU. They might use an FPGA to do acceleration, and this thing is outside of the enclave. Of course, you have to communicate with some channel, maybe memory buses or whatever, maybe some network linked to it or some other model.

26:29

But. So there's an issue there. It would be nice if somebody builds an enclave, CPU, an enclave, you know, trust, execution, environment extension to TPS, which is Google's essentially tensor matrix multiply roughly a bit more than that. There's yeah. Okay. There's a bottom line there as well, which we'll come back to. Okay. So, so the idea we have is we looked at Spark and okay, this is cool, but it's very big.

26:57

You've got Spark, which is kind of huge amount of code that does lots of cool things. Actually, it kind of maps, functions over things and then and then and does some cool iterating of that. And it's not that complicated, but it needs a JVM which has to be put in the and K too, and that's very big. So how about we partition this software? So we were on part of it in the end. Clay And the all of the functions which need may not be touching sensitive data.

27:22

So we could do that by a mixture of static analysis and runtime analysis, static analysis and say what is touching the data, you know, obviously, and then run it and then say what such and but actually you don't need to really do that. That's a cool thing here, which is spot is applying a function, right? So and that function is being applied.

27:42

If people write code correctly in an enclave and of data that's come off disk or come off HDFC will come off a cash deal, come off of an RTD and it's encrypted. So it hasn't been decrypted yet. So we don't actually touch it's not sensitive at that point. So we don't have to worry about any of these, no less. So we can probably just deal with the very core pieces of of SGX, SPARC. So this is just sort of going through that detail. But I don't, again, have time to go through.

28:12

But this is just saying, you know, what has to live inside the NRA is really decrypting into data, compute f of what we're, you know, iterating over the input, encrypt the result and so on and just do that. Okay. There's two steps in there just to illustrate that. And this is just showing a kind of more general partitioning of the components that go on. So, again, I don't have time to go through the details of this.

28:40

There's. Movement between different JVM because you could be running multiple instances of spark and so concerns we have to worry about that. So that involves not just having encryption of IO two storage and encryption IO to networks, but now we had two encrypted shared memory. So that's another thing we have to manage, which is again another weakness of all of this, that sort of this is a house of cards.

29:02

And you can if a security personnel system password, you could pull out any one of those cards and say, but what if you get that wrong? There could be a vulnerability there, just like there could have been a vulnerability in, you know, the old model, cloud model with the hypervisor and so on. Yes, they could, but we could fix this and then we move on. Okay. So. Okay. So that's the first part of the talk. So we have that all working. As of around just before Christmas, actually.

29:28

And then what happened? I'm going to move on to this topic in a second. But what happened just after Christmas? Hands up. Yeah. I hear the ghostly voices. Spector, spector, his haunting, you know, the the clouds of Europe and the world. And the first thing we did was check that spectre brakes and and some folks had just published a more detailed thing,

29:49

but I think we were the first to go, oh, dear. So this may hopefully get fixed at some point, but basically speculative execution in very complicated market architectures like Intel have allows you to do things that happen. And what would normally in a sort of sane world would be another thread and would have all of its access control checks in the right way, but doesn't in this world. So to use all the resources you have an approach is just in case this branch of code might be useful later.

30:22

You can run it. And the first thing that these very large number of people I don't have time to quote and go through found was, oh, you could you could break the userspace operating system boundary. This because speculative execution would just go off and start reading OS memory or memory from other processes. And but then the speculative execution would end because it was wrong, the branch was wrong and will be terminated, in fact be shut down because it had done the wrong thing.

30:45

But there's a side effect. And a side effect is what Intel never claimed to protect against, which is any data access that pulls things out of memory into the cache, leaves traces in the cache until the cache is evicted.

30:58

Unless you manually evicted from the cache, they're around for a while. So cache memory, cache, hierarchy, you've got a short amount of time where you might be able to read that unless you change all your code to evict things from the cache, where then your processor will run ten 1000 times slower, which is kind of just got rid of the cloud having any point whatsoever.

31:15

You're in trouble. And, and so it turns out the same kind of attack works across SGX because basically, as I mentioned, the memory encryption decryption happens from RAM into cache memory. But you're still subject to the same possible attacks, interestingly enough, on, um. Trust zone. It doesn't appear to. And I think what's happened is.

31:37

AAM have a very simple elegant design where when you do spec contracts Houston's branch across the crescent boundary it stops and goes no you don't have to do that correctly. But it does work across the OS boundary in the arm, which is a bit puzzling. So we're kind of okay, you know, there are ways to mitigate this by changing code lots and lots of places and wait for new processes.

31:59

So if you're kind of a data centre, like a typical Facebook data centre with maybe a million cause, you know, buying a million new cores is quite expensive, it going to take a while. So this stuff is not really ready for hardcore prime time. But actually so I should say kind of yes it is, because you could always make sure that your application and SPARC runs in SGX and only runs on cause I have no other things running. And then you go, well yeah but we, then we don't get the speed that we need.

32:29

Yeah. But if using 100% of the core CPU time anyway, that's okay. And then you could argue well then you don't get the cost saving moving for your private data centre and so the cloud. But you do because you get the amortising over various different, you know, operational costs and so on. So you still get a cheaper thing, but it's you don't get that maxing basically, which is definitely a bit of a negative thing. Okay. How am I doing for time? I can't read that clock. Half an hour in.

33:01

Thanks. Perfect. Well, no, but too many slides. But you can meet them later and catch up. Okay. So in parallel with that, which was trying to help people like the NHS or the financial service folks use the cloud in a way that we thought would be secure, a more secure. There is no security. There's just, you know, you mitigated these things and then the arms race moves on.

33:26

The parallel that we've had a completely separate line of work, which comes from the opposite, which is distributed analytics.

33:34

So the idea here is instead of moving all the data to the cloud and doing a computation now, instead of all these hospitals moving all that data into central data, data, databases and wherever, and then maybe copying that securely into the cloud, running a secure computation so you get more CPU and then getting the encrypted output back to their doctors, their medics and researchers. We said leave the data everywhere and to distribute the code to people.

33:59

This is the opposite approach and it's very old. Two patents in distributed computing. Yeah. Move the data to processing or move the processing. So the data is kind of classic. You know, of course you could call it hybrids. You'll be sitting there thinking, yes, yes, computer science is really good at patents. We could do it one way or the other, or we could do a mixture. But I'm going to talk about this extreme. And the point of this extreme was that you keep the data at the owners.

34:24

And this is really targeting different classes of data, at least initially. We're thinking of your social media data, your health care data on your phone, maybe monitoring your heartbeat, your skin conductivity, your temperature, your number of steps you've taken today. Why do you need to give that to anyone else ever? Another example, I think I'll just quickly go through the poster child example I think comes from a smart metre project by George Synthesis when he was at Microsoft.

34:51

He's now professor at UCLA and for security. But he did this beautiful project which is designing smart metering and never gave the data from the metre to the that the reading dataset was never given to the electricity or gas or water provider. He would just give them the summary data. Why did they need to know? What the current in and out of your house every 2 seconds is.

35:15

Later. That's complete nonsense, right? They have all kinds of current limiters and fuses and cut outs to stop bad things happening. But they want to know what the reading is each month without having to visit your house.

35:27

That's their big cost saving, and they may want to send your metre a price so they will app on your your home hub management system could say here's some clever things you could do in the house, like not to turn on your dishwasher and washing machine until 4:00 in the morning because then that would be the best price point. Okay. And they might want to do that in a clever way. That is huge. But they still don't need to know what you use every 2 seconds or every 2 minutes.

35:50

That's just irrelevant. They need to know summary data, but you might want to record that data all of the time. They want to record it so that they maybe, you know, check. So make sure the summary is correct. So the sort of poster child here example is, well, what is what is the what does the electricity company want to know? That's more fine grained than a one monthly reading per house. Not as fine grained as a reading every 2 minutes.

36:14

They might want to know what kind of household you're from so they can work out a profile of pricing to see what your price sensitivity is. And also, when we're just about to run out of gas a couple of days back, you know, what could they set the price to be to alter the really big consumers price at the busy day, maybe for a certain class of users. So they need to know what class of user you are. So how many classes of users might there be? You know, how many household types are there?

36:41

So how about we we know that from historical data there could be 16 kinds of domestic houses, maybe sweetie. And 16 kinds are characterised by some distribution over the, you know, some samples through the week, over the busy minute, the busy hour, the busy day. And that is sufficient to sell this house and not that house. So they need to acquire the parameters of that model. So we can do that in a decentralised way. We can run all kinds of these machine learning algorithms. We already do.

37:11

We run them in the data centre distributed, except that we run them with all the data coming off the local crypto history. We can leave the data there, send the code out to everyone, learn the model parameters at each node and share the model parameters. Now, you might say even that reveals something about a household. Yes, of course it does. And to some extent.

37:30

But you could also down the bottom there, you could share that information, peer to peer, while you build up the model, you build the accuracy. So if you all 16 been histogram and you're learning what the different models are that fit in that, and you get your accurate model off to some number of iterations, you're doing machine learning over this thing and you say, Oh no, that's good enough. Now we can ship that to the electricity providers for their cost for their customer base.

37:55

And at no point did you give detailed data to them. Okay. So that's sort of distributed. Machine learning and lots of ways you could do that. Again, this is really neat because you avoid the whole problem of GDPR. So I'm going to go to that. But at no point did you give the data, the raw data. 20 Well, you, you do have an interesting problem.

38:16

Again, okay, folks have some really good stories on this, which is if you make a decision to change the price for a customer, they might go, Why have you changed my price to be that my neighbour got a different price change and you have to explain that it depending on the model complexity,

38:29

it might be quite easy to explain. You might be able to say, well, you know, you have four kids who put in a dishwasher five times a day, you know, and you don't have to know that you would feed the price and it would go into the model, the data you have in your house, and it would pop up that thing, go, Oh, we can see the model fitted this and then you go. Okay. So so we built this crazy distribution analytics platform and there are lots of pieces for this.

38:54

The last bit I want to try and get to is how you do very wide area distributed machine learning. So the first piece running SPARC is important because it's very high throughput. If you've got a lot of data in a data centre, you have a bunch of nodes in a very large memory footprint, ignoring SGX limits, very, you know, multi gigahertz processors, lots of cores, ten gig, even 100 gig, ethane everywhere.

39:19

You've got a bunch of people in smart metres in a home or their smart TV in their home, and you're sharing mobile premises between homes. You've got a wide area network and the uplink out of people's home is typically today ADSL in 90% of the UK is of ADSL is around a megabit uplink it's about 10 million homes on fibre.

39:38

So then the uplinks are a bit faster, but the smaller parameters in the histogram, some values, they're not really a lot and you don't have to do it very often because how often do you run that computation? How how high throughput is it? If you were doing this on people's smartphone and you're trying to learn a model of them as part of a model of lots of people's, you know, health response to a sudden drop in temperature and they're going out running.

40:00

And then you might want a feedback, somehow a warning to a collection of people in this thing saying, you know, famously, don't go and take the snow because the temperature drop combined with sweat will cause heart attacks. A large number I lived in Canada for was this warning you get. If you're over a certain age, it's like, you know, get your night, your neighbours, kids to drink, take the snow.

40:19

Okay. So we built this platform called Owl, which is a distributed numerical package basically to start off with, and it's written in a camel. And we had a goal for doing this. We have a library operating system. Instead of being written in C, in C++ and being thrown in, it's alcohol. Well, this is from a library operating system we have in Cambridge called Mirage, which is a very cool system. Now, Camel, which means we don't have a large class of vulnerabilities.

40:45

You don't have to say you might say what a camel y camel is like a variant. I mean, why didn't use Haskell or why didn't use something else is because we're Cambridge. We use a camel. But, you know, it's you could redo it in 3 minutes in Haskell while no 11 minutes but eshop right so okay. So we built all these things in in for doing that. For this reason it was for doing this distributed system.

41:09

And so it's a lot of different applications. We've built a mad amount of code, so a lot of cool people contributing to this out there.

41:16

So it's a brief picture of the the whole architecture of our sort of distribution in parallel analytics with a whole framework for applying functions over data in a wide area and various system backend is a lot of people work on this in Cambridge just to say we even got bits that go in browsers, we have bits do memory management for types we have this is all, by the way, Sam Statman in Oxford does some of this stuff with monads and numerical stuff.

41:45

So there's a, there's some very cool theory people here which we liberally heard their bread, their papers went, yes, good, we can use that. Okay. So and we have we can even run code, we can map code down onto GPUs and all kinds of other pieces. So we have. You know, we have we have a raise, which doesn't sound very functionally very nice, but we have ways of doing MapReduce over those. We have neural nets and we have a way of doing peer to peer neurone so we can train up on your own.

42:15

And we have a poster child. Example of this is learning to recognise faces. So we have a neural net running in raspberry PI's running this code with this library operating system matching the dot container and tracing. So it's even bullet proof at that level probably. I don't know how good that is. And then we're training on faces and we share all these parameters in this neurone about faces between lots and lots of little tiny notes that learn about three faces in this house.

42:41

Three different faces in the house. What are features that make faces? And then you get a better model. And then you go, Oh, this is somebody who lives in this house. Oh, we don't recognise that is a face, but we don't know who it is. So. Right, so that's a, that's an application we have, it's an example of why would you want privacy and so on. So we skip that code again. You don't need to see the code if you want to use this.

43:02

This is all downloadable as well. So I wanted to skip to now you're doing a distributed computation and there's this problem which is your you need to if you're iterating over data in our grand vision. Right. Is there say 35 million homes in the UK. Imagine every home has a Raspberry Pi or whatever your favourite small computer as a home hub.

43:25

You know, we just send it to people in the post and you know, they just plug it in and forget about it and it's just hidden inside something they bought anyway. Right. And it says, you know, approve. Does this will look after your personal health data and we'll never send it to anyone else unless you prove and it might back it up encrypted to your GP's, you know, cloud service but they won't be able to look it without your permission. That'll be a sort of model of the world.

43:46

So now you want to learn about things on that data, over 35 million nodes and you do an iteration and you need to do a sort of next step of the iteration. So if you're doing a classic MapReduce, you kind of everyone does this bit of the data and then they do this huge exchange of data. And you need to synchronise everything, don't you? Wouldn't you have like ten square messages going everywhere at this point? This is a problem for training units in a data centre.

44:13

If you paralyse training neural nets over data, you split all your face data over lots of big nodes and data centre and you run your tensorflow for it. Then this is huge. Exchange rate, you go through the next step of the operation, the output comes out and you look at the you know, you may be looking at gradients or whichever thing you use for the feedback into the training, but into Sherritt, all with the other nodes.

44:33

And you've got an N Square message problem. So, so this is the kind of piece of what we had to start thinking about, because now we're not in the luxury of a data centre where that won't scale and square, but at scale, certainly one scale, 35 million. But you don't have 35 and you might have a you might have 100,000 cores in a data centre is still 100,000 squared is not a good number of messages for every step of the iteration. So what do you do? So you need to throw away some stuff.

44:57

So, so the classic sort of barrier synchronisation step in a kind of Hadoop is not going to work. So you can come up with other ways of doing this classic one if you want to see that. I think my the best read paper is probably hog wild where you have parameter servers that actually you send stuff to one point. So you have messages rather than square and they share it out. But there's more recent work where people have done other things you can do and you could run a synchronously.

45:23

Why could you not run asynchronously in training? Well, it depends on the learning, but most algorithms might be gradient descent. Stochastic gradient descent will still converge. Even if you don't do everything at the same rate. You can be asynchronous even you can lose data if you do more iterations more than you lose accuracy by losing data. Then maybe you speed up the overall computation to get to the accuracy level you want your your training to run out.

45:53

So. So that's kind of the theory behind what we built in this, where we relax all these things and decompose synchronous. Basically what we end up with. Right, is probabilistic. And there's one club. This is all name checked the right people. And hopefully there's one fantastically clever bit in this which is actually is faster and more accurate. But this is this is by Liang Wang, who's a post-doc in Cambridge. He came from Helsinki with a very smart Ph.D. and started working in mobile networks.

46:25

And he looks at this stuff. And right now what you need to do is this, and you need to essentially discard results statistically. And you know what? If you if your sampling algorithm is correct, then the system can be made to converge arbitrarily as good as not losing those results. So you look at the accuracy you're getting from different nodes, giving you data, the output, the output parameters you normally centre, the whole qual parameter server.

46:52

And then you can do really, really well with this. We haven't tried this with 35 million homes, so yet that's some ways off. We have tried it, you know, in a small systems with a thousand nodes and, you know, little test beds and so on and you're scaling stuff you can sort of extrapolate with a thousand is a large number, I think probably. Yeah. So we have this cunning sampling primitive, which is kind of a clever way to hold a function, so just an implementation trick and so on.

47:21

So we, we've sort of discovered something we think I thought this was fantastically new. And then I examined a PhD on distributed neural net training, an imperial firm, Peggy Kerr. Remember she now Microsoft in a health true machine learning group. Very, very cool. And she'd come up with a very similar diff, slightly different angle with similar trait about two years before us. So we're like, Yeah, okay, that's good, that's good. Somebody else. Reproducibility research is a good thing.

47:49

Okay. So and the tricks you can play in here with sort of the whole trade off again, I don't have time to go through these numbers but so you can change the step functions in training. Scalability is good, robustness is good and convergence is good and so on. And this is just sort of the bottom line is trying out this with some test set data sets that people use as standards in training these systems.

48:16

If you're this is this is kind of scary actually how good this is one years coding by Liang and five other people and some other contributors around the world and the the green is sort of inference time for for you know training up in section V three for example is a classic versus a TensorFlow and Cafe two doing the same thing and we're in the same ballpark figure. And this is written in a camel.

48:39

And it's, you know, it's very small, like numbers of lines of code, and it's really high level and readable and blah, blah, blah. There's all kind of a good thing. Okay. So that's kind of the end of my slides. Apologies. I really I didn't have a long enough train ride this morning to delete sort of the old joke. Half of them. I shouldn't have that. Please do email me if you want to follow up on any of the pieces here. There's very specific kind of acknowledgements.

49:07

This is this is other people's lives or other people's work. I'm kind of a person who runs around trying to get the money from the funding agencies. You know, three goes, you get it usually. And the important groups, like I mentioned, is the large scale Distributed Systems Group run by Peter Pitts, a professor, Imperial College and founder computing. If you want to do systems work, absolutely great group. They have loads and loads of good people there and they do this kind of stuff.

49:35

And then we in Cambridge we have an amount of petty, rich, malattia, nag Wang and a host of other people. This project was the spark putting in our into and Java into SGX is funded by the Turing and actually the main interest upon a Turing the defence people because they want to be able to do analytics and surveillance data and be able to prove that they're the wrong people. Didn't see the data or the people didn't see the wrong data.

50:04

So they want to have a get out of jail card because they're now under the law, which is kind of interesting. But it's the same kind of motive, which is to be squeaky clean as the health care and the financial data. Right. So they have centralised all the data. Obviously, the other side here is data boxes and APC funded project. And the other partners in that I mention is Hamid had AIDS and Imperial and some folks at Nottingham and I didn't mention the downside of that.

50:30

Okay. So there's a downside of the Spark's SGX stuff, which is the Spectre speculative attack plus side channels, which there are mitigations for, but they're problems. The downside of the data box stuff is we really, really haven't got a good solution to the how much can you learn by observing the mobile updates?

50:53

Now, if you've read about machine learning, and particularly in deep learning, there are very clever people who've worked on how much can you infer from a trained classifier and then fix that problem for that one instance? But if you're looking at the thing being trained, you can probably infer most anything.

51:10

And so there are decentralised attacks when all a decentralised approach, somebody could just basically join the network in a peer to peer system, get all the updates and then infer all the data pretty accurately. So we don't have an answer for that. That's like a well, okay. So THQ can infer how much electricity use every 2 minutes. On the other hand, you know, I suppose there, you know, in the financial side that's not a good solution,

51:34

but the financial side we think is over in this space anyway. So it's not it's not the threat threatening, but in the middle is probably healthcare data where it starts out with a lot of it being centralised in hospital data records. But more and more, we're moving into this evidence based medicine where you carry devices and they monitor stuff about your behaviour and so on.

51:51

And then that is the on the, on the data book side and then you might care about people inferencing things about your health which may not be public matters. And so, okay, so that's about it for my talk and I guess any questions time. Look, let me boot things off. So we were briefly before lunch discussing kind of multi-agent systems and actually the particularly the OWL framework, they had a very multi-agent system. Z feel about it, is that right?

52:29

Or A That's a very good observation, which I completely not thought of. That's not how we think of it. But I write we should probably. That's a good point. Yeah. Yeah. Because it's basically we've moved to a set of asynchronous nodes which are distinctly messages or about what they've learned for parameters, for a model of what they've learned, not the data they've acquired. And that could start to look very multi agencies to be felt very much along those lines.

52:55

Yeah, but it starts from a traditional training, you know, ensuring that we're doing linear regression of a, you know, multi dimensional thing and then we do this and then to make it work in a large scale, we go async and I guess we end up in the same kind of space.

53:10

We're not we haven't thought about coming from another space which would be running a, you know, a probabilistic programming approach as well, which would be another thing I think would be to inject that into that just decentralise architecture would be fun. So there's probably some confluence of that stuff architecturally, which would be interesting. Yeah, no, it's really interesting. Thank you. I will take that back home.

53:35

You mentioned the distributed energy fixes an interesting idea that the data was stay at the source. So. So there's a range of techniques being proposed in the research. Community. Thanks for privacy conclusion. What would be your judgement? How far you are from? Well. So in the. There's a separate, slightly, slightly different thing, which is, well, sort of homomorphic encryption would be a lovely it's kind of like cold fusion.

54:09

I mean, that's not fair. It's like it's like not cuttlefish is like normal fusion. It's sort of we get it's a bit more like it's little nearer than quantum computing, practical quantum computing. But to be fair, it's demonstrable and it's the most reliable, very simple functions. And that would be really cool because that would be much better than relying on unbelievably complex extensions.

54:28

Intel do. So you have homophily encrypted data and one of the crypto functions over the data and relatively simple code getting it to go fast. Charlie is the big challenge, but people are actually making. So I think it's probably fair to compare it with the sort of normal fusion where they're actually visible progress in plausible directions and in specific functions. I've seen some really cool results there. So that's definitely it's a good research direction in cryptography and math.

54:54

If you're in that space, super fast algorithms in that space, the differential privacy might be a technique we throw at the decentralised exchange of things where we might put a bounds around what we exchange to make sure that a certain number of peers get the data for the models and the data is checked against an epsilon to make sure it's differentially private. It doesn't reveal things about the raw data. The model parameters are coming from.

55:21

I, i we'd have to think about, you know, what the relationship between model promise parameter inferring is, but other people have done some of that. So that would definitely be a thing. And then we were talking earlier about when you federate data from multiple agencies into a central system before you even get it. In the central system, a lot of cases differentially private data might be good in it, good enough for a lot of problems. We're trying to do infancy of some health care thing.

55:48

You know, I sort of jokingly said, you know, maybe use of ozone in swimming pools causes asthma, you know, so you do a map of where asthma attacks show up in hospitals. You do a map of swimming pools where they use ozone instead of chlorine for for, you know, cleaning the pool. And then you do a correlation and you can do, you know, location based differential privacy pretty well normalised to the population distribution.

56:10

And you still have enough cases of these. I mean, I you know, I don't believe that's an actual causal link, but I'm just giving you as a hypothetical example would be a question you might ask and it could be done securely with differential privacy and you don't need any of this complicated mechanism. The data would still be staying in privately owned, secure databases and only query results that met this. This limit would be a good exchange.

56:32

So I think that's a technique and apply lots of places and is very practical and well described in the literature. Um, yeah. Okay. Next question. It's one of their. Oh. There's a microphone. Yeah. You seem to have talked a lot about the confidentiality side and also in the distributed bond availability. But in terms of integrity, you mentioned integrity of data, the encryption level, but what about injection of data sort of Byzantine problems?

57:05

Because if you're going to do smart metering and there's money involved, can I get somebody else's bill changed? Can I actually inject bad data into this as a great point and I truth in advertising, we don't have a fix for that in our decentralised architecture or in a peer to peer. Well, we're subject to all the attacks that are being demonstrated time and again on those worlds where money's involved, I suppose.

57:32

I suppose there could be some way of carrying signature data through their mobile inference that sort of signs the, the, the model and says this was derived in some way from something. So the metre companies still own the audit trail without getting the detail result. But I don't I don't know that would that's completely fair criticism in that space we have to to tackle that somehow I don't you know we could try to sort of have a conversation about it or if you've got a solution,

58:01

I'd love to hear one, but I'd say. Brilliant. Thanks. Thanks. You? No, it's a completely fair point. Yeah. And it's always a problem of these decentralised systems that they have some plus point or some minus side. That's one of the big ones. Yep. Injecting fake data with. Yeah. Yeah. For fun and profit. Good point. There was a question up there. Right over there. Thanks for a very practical question, I suppose. Have you seen beginning your lecture to SGX?

58:33

Have you seen that be made available on the likes of IWC or these sorts of large cloud providers? SGX on IS on Azure as confidential cloud computing, and I haven't managed to yet to talk to the people that do it, but I believe IWC will have it, you know, in the next minute or two. But, but of course we have this problem with the Spectre attack. I believe the way that Asia deals with that is they have only ported their tools.

59:06

They want you to use the confidential cloud that they have, which is Hadoop and SQL Server. And so therefore they can do the they can handcraft the mitigations for the attacks. And I think that that they're really good, that group actually. So I suspect that they've got that right. But that means you're sort of in some sense is stuck with the tools they've done, but that they're pretty okay tools for some things.

59:28

So I think generally deploying SGX as a sort of extra service for users to just use, firstly, you need some newer processes and they're generally not in the high performance yet. And secondly, the memory limit probably will kill a lot of your customer base unless somebody is very clever.

59:47

And thirdly, now we're just sort of worried about, oh, there's a new version of it's supposed to be out anyway, which may mitigate the Spectre attack better and gets rid of a large part of the memory limit problem. So I could imagine really big cloud provider would just be, you know, having lots of conversations with Intel and maybe even saying, oh, we're having a conversation with Andy at the same time, you know? But yeah, it's a good question.

01:00:07

But I think if you want to use this, you can go to Azure. And they have a couple of things that I think are pretty, pretty solid. Other questions. Okay. Well, everybody, let's thank our speaking and thank you.

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript