Strachey Lecture: How Can Algorithms Help to Protect our Privacy - podcast episode cover

Strachey Lecture: How Can Algorithms Help to Protect our Privacy

Nov 13, 202355 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

In this term's Strachey lecture, Professor Monika Henzinger gives an introduction to differential privacy with an emphasis on differential private algorithms that can handle changing input data. Decisions are increasingly automated using rules that were learnt from personal data. Thus, it is important to guarantee that the privacy of the data is protected during the learning process. To formalize the notion of an algorithm that protects the privacy of its data, differential privacy was introduced. It is a rigorous mathematical definition to analyze the privacy properties of an algorithm – or the lack thereof. In this talk I will give an introduction to differential privacy with an emphasis on differential private algorithms that can handle changing input data. Monika Henzinger is a professor of Computer Science at the Institute of Science and Technology Austria (ISTA). She holds a PhD in computer science from Princeton University (New Jersey, USA), and has been the head of research at Google and a professor of computer science at EPFL and the University of Vienna. Monika Henzinger is an ACM and EATCS Fellow and a member of the Austrian Academy of Sciences and the German National Academy of Sciences Leopoldina. She has received several awards, including an honorary doctorate from TU Dortmund University, Two ERC Advanced Grant, the Leopoldina Carus Medal, and the Wittgensteinpreis, the highest science award of Austria. The Strachey Lectures are generously supported by OxFORD Asset Management

Transcript

Thank you. Thank you, Leslie, for this very kind introduction. And thank you very much for inviting me to come here. I'm really excited to be here. And, uh, yeah, to tell you a little bit about differential privacy in general and about my research then specifically in this field. So the talk, as we say, is should be algorithms who protect the privacy.

So algorithms, you know, are everywhere these days because every computer program and also every integrated circuit is actually implementing an algorithm. And so if you use your phone or your laptop or your home appliance or your car or your cleaning robot or certain medical devices, um, you will obviously have algorithms running there, and they're usually very useful, but they can also collect data about you right now, data is also collected in other ways.

For example, governments collect data about their citizens. In the U.S., for example, they do a census every ten years. Or if you participate in research studies, there will also be data collected about you. Um, this data is usually collected for a certain purpose. So for example, governments collect data so they can decide how to distribute funds who are not.

So uh, um, also insurances use it to decide, you know, how much it costs you to costs to insure you banks to make loan decisions or, uh, doctors use, uh, AI algorithms right now in order to decide whether they should do certain medical procedures on you or not on your phone. Um, if you use, for example, Google keyboards, they suggest word sentence completions. And that's usually done, um, based on data that is collected by their users.

So in general, this sounds like it's great. There's certain usefulness in this data. So there's this utility. But on the other side, of course, immediately comes to the concern. What about our privacy? So you have two conflicting goals. On the one side, you can use this data to do something useful in the world. Um, but on the other side, how can you protect the privacy? And this is really the goal of these algorithms that I will be talking about.

So they want you. They want you to be able to use the data, but in such a way that the privacy of the individuals is protected. And try to explain to you how they are trying to achieve that. Um, now, what are the challenges for, uh, privacy? Uh, there's actually one challenge. Let me start with a second one here. Uh, one simple challenge is already that statistics. If your database gives you exact data, might already leak a lot of information.

For example, if we know the average salary of the people in this room, and then one person leaves, and then I give you the average salary, again, you can easily figure out if you know how many people were here. You can easily figure out what was the salary of the person leaving. Right. Some simple math. Um, so in general, by giving exact answers, you might already leak a lot of information that you might not want to leak.

Um, even worse in machine learning. Very often this, uh, algorithms actually memorise data. So, um, for example, there was some paper in 2018 by collinear oil where they showed that the Smart Compose, which is the sentence completion in Gmail, uh, was leaking the Social Security number of one of the users, okay, because they had used training data that had that information in it. And it's happened to learn that by heart. So how can you deal with this?

Well, so the first idea that people had was to say, oh, why don't we just anonymize data? Isn't that good enough? Yeah. So I might release data, but I just don't put the names. And there are all kind of famous examples that showed that this was a bad idea. One of them was, uh, uh, published by Narayan and Sharma Cough in 2008. Um, what they did was they used data that Netflix had published, so Netflix had published. Uh, so here every row was a user and every column was a movie.

And so the information was whether this user liked this movie or not. And they had anonymized the data by removing the names of the users. But what these authors showed was, oh, if you combine this information together with some public information that the Internet Movie Database had, because many of the users that gave their information to Netflix or the then published on the Internet Movie Database saying which movies they watched and whether they liked them or not,

and they gave their names. And so if you combine these two data sources, uh, they were able to actually figure out the names for many of the users. And they showed that basically, on average, four movies uniquely identify a user. Okay, so it's a bad idea to just anonymize the data. Because if you combine different sources, you might be able then to de anonymize it. Okay. So. So the first idea doesn't really work by itself. Of course you have to anonymize, but you have to do more.

And so we what it also shows us this example is that we need techniques that can handle composition of data sources. Yeah. When you analyse what happens if you combine different data sources. So here are the two data sources the Netflix in the Internet Movie Database. And then the second idea, the researchers, which was cannot be generalised to randomised response. Now let me explain to you what is randomised response. And then I will explain to you how they generalised it.

So randomised response is a technique which was invented by Varner, a sociologist, in the 1960s, and this has been used ever since without any big privacy concerns. Okay. It's a good knowledge, um, to have. And so for later in the talk. And so it has it solves the following problem. Assume you have some population and you want to find out what percentage P of the population has property A now property A might be something embarrassing.

You don't wanna admit that you have property A like snoring at night, for example. Uh. Worst things. Yeah, I leave it up to your imagination. So nobody wants to imagine that they have prob ever wants to admit that they have property. A still you want to find out what percentage P of people has this property. And now here is a solution that Warner came up with. So you have your volunteers who participate.

And instead of asking them whether they have the property or not, they would just lie to you. You ask them to run this simple algorithm. So they should first throw a coin and not tell anyone about the outcome of the coin flip. And then if it was hit, then they should answer truthfully. If it was no, it was not hit. Then they should throw the coin again. And now they should just answer according to the coin throwing.

So if it was, heads say yes and otherwise say no. Okay, but they should not tell anyone, uh, how often they threw the coin. They should not tell anyone the outcome of the first coin flip. Okay. That's the private information. And so they should just give this information back now. Yes or no depending on this algorithm. And then what you do is you determine the percentage Q of yes. Answers from this algorithm. And then you output two times q minus at half as estimate for p, the two answer.

And it turns out that this is an unbiased estimator for the true answer. So it will give you a good estimate. Now why does this also pursue privacy? Well, because you have what's called plausible deniability. So everyone has to answer yes with probability at least the fourth. If you have the property, then you will answer yes here with probability half here and then here.

Or as it was probably half, you said no and then again with probability a half. So all together with probability fourth you will also answer yes here. So if you have the property you will have a probability of three fourths of answering. Yes. If you don't have the problem with the property, you will still have probability of one fourth right of entering. Yes. So everyone has a probability of at least the fourth of entering.

Yes. So if you answered yes and then I come to you and say, oh, you have property, I, you can say, no, no, no, I actually don't have it right. It was just the coin flips, not me. It's a coin flips. Right. You can blame the coin flips for you, I answer, and nobody can prove you otherwise. Okay, so it's private. And note also that you need randomisation. Without randomisation, this would not work right that you need to keep private for yourself. Okay, but randomisation is very important. Okay.

So now the question was how can we make this more formal and more general. And that's exactly what differential privacy does. So it's a mathematically rigorous definition. You will see it in a second. Um and it measures the privacy protection given by an algorithm. And the goal is, first of all, to make it very unlikely that if you combine, uh, that user data can be reconstructed. If no matter how many data sources are combined.

It gives you a way to measure how much impact on your privacy the combination of data sources has. And then you can have a threshold and say I'm only allowing so much privacy loss. And if this combination would be a profit, then I'm not allowing this combination. Meaning I'm not giving out my data anymore. Um, and it also gives a prime, a parameter that tells you for each algorithm how much data leakage stays, how much privacy loss stays.

And there are several current deployments. And I'll talk about this, uh, later in the talk. Okay. So now let me give you the informal definition before giving you the formal definition. The informal definition is as follows. So assume you have a database and it has a thousand people in it. And you ran some randomised algorithm on it. Maybe computing the average salary or something okay. And so you run some randomised algorithm on it and it gives you some noisy answer.

Okay. Noisy meaning because there was some randomisation involved. And now assume that you have the same database except the last person has changed. There is no person 1000. Everything else is the same. And now you run the same algorithm on it. And you also get a noisy answer. And now what you want is that you should return the same answer for D and for B prime, with almost the same probability.

So mathematically speaking, the first the algorithm executed on the first database gives you some probability distribution on the output potential outputs it. And the one on the second input gives you another probability distribution. And what you want is that this probability distributions are very close, very similar. So it's some kind of smoothness requirement on the output of the algorithms. Okay. So if the inputs are very similar like here, they only define the last person.

Then the output distributions for your answers should be very similar. Okay. And if this is the only thing you remember from this talk. You got it. Okay. I will have it again on a later slide. Now, why is this? Why is this a good definition? Well, because then, based on the answer of the mechanism, it is very unlikely that you can distinguish between d or d prime, that you can figure out that the input was d, or that the input was d prime.

You will not be able to distinguish because the output distributions are so close. It would be very unlikely that you can distinguish I should say. Okay. That's the idea behind it. Now, more formally speaking, um, so we say two inputs are neighbouring. If they define at most one entry and other be more formal in the next slides because it depends on the application. What does it mean? One entry. Okay. And so epsilon is some number larger than zero.

And now we say a randomised algorithm. Sometimes they call it mechanism A with a set p of possible outputs. Is epsilon differential private. So p is all possible outputs.

Is epsilon differential private if for all possible subsets S of P and all possible neighbouring inputs D and D prime, the probability that the mechanism started on t outputs S is at most e to the epsilon times the probability that a mechanism with uh input d prime outputs an element of s. That's why I gave you the informal definition beforehand, because this is somewhat of a mess. Yeah. But let me try to interpret this. So Epsilon you should think of as a small number close to zero.

So e to the epsilon is just a little bit larger than one. Okay. So what you want is the probability. And instead of saying element of s you can just think for simplicity here. Um, that's is just one element of P. And that this is equal to S. It makes it easier. So you can think the probability that the mechanism outputs an element of S is almost the same. If you the input was d, what if the input was d prime.

That's what this definition says. And the randomness here, I should emphasise, is only over the randomness of the here, for the probability is only over the random choices of the algorithm. Okay, the input is worst case. We are saying for all possibility and d prime okay, there's no input distribution. This is only on the randomness of the algorithm. And of course, because we say for all possible neighbouring D and D prime, you can also by symmetry exactly get the inverse statement right.

So you get that the probability then you started on D prime is a most e to the epsilon times higher than the probability when you started on D. So really what you get is an upper and a lower bound for the probability. When you started on D it's upper bounded by at most e to the epsilon times the probability for d prime and its lower bounded times e to the minus epsilon. And so if epsilon is small close to zero then this is roughly one. And it gives you a good privacy guarantee.

Now, uh, one thing, however, you should keep in mind that in practice, Epsilon is not a small nor close to zero. Uh, you think of epsilon is more like 7 or 8 when used by companies. So E to the epsilon is already pretty big. But even worse, in the US census they used epsilon equals 19. So you might argue well you know this has probability over individuals. And you know that 300 million Americans, the accountant blah blah blah.

But then you only count a subset of them anyway. And so but still, you know, e to the 19 seems like it's a big number. Okay. But in theory epsilon is small, close to zero. Okay. And so the crucial thing is, uh, this definition of neighbouring. Right. So we have this constraint on the probability distributions for neighbouring data sets.

Okay. So here again this is the one thing you should remember from this talk is differential privacy means there's a strong requirement on the output distributions for neighbouring inputs. The sort of a smoothness requirement. They need to be very close. And this is completely different from anything I've seen in algorithms before. Okay. This was not studied before. And that's why this is also such an interesting research area. If you had not thought about such kind of algorithms before.

Okay. Now why is this useful to have this definition? So now I can informally. So assume that the use of data AI is used in database D, but not in D prime. Then. Now you could write down. Well, what is the harm for a user ie. If the output was a s and you can multiply this with the probability that the output was s, and now you can sum over all possible s. So here you get the exact expected harm that you have. If I say data was used in the database.

And here you can do the same thing for D prime D primaries where ice data was not used. And now because you know this bound here, you know that this expect the term here um is a multiplied with e to the epsilon is an upper bound to the harm that you get if the data was used okay or was differently. Any harm for user I that happens with the inclusion of data. I is that most effect that e to the epsilon higher then the harm the expected harm without ice data.

Okay. So your. This definition is useful because you can found the expected harm that you have by being included. And again epsilon close to zero is good. Epsilon 19 might not be quite as good. Okay. And each of the epsilon is called a risk multiplier. Okay. So now, what is a good definition of neighbouring? Uh, now, especially in the early days, people in differential privacy mostly looked at databases, which they thought of as tables. Okay, so they saw the tables. So you have some table.

Every row is a user and every column tells you something about the user. And then the standard definition of neighbouring goes to databases are neighbouring. If they define at most one row okay. So one user is different. But, um, then they said, isn't this maybe too generous? And they restricted it more. And they say two databases are neighbouring. If they differ only in the inter entrance for one row, so one entry in one row is different.

What even more restrictive one entry in one row differs by at most one. Okay. And so as you notice, these definitions of neighbouring become more and more specific. And thus the requirements become less and less strict. Right. Because you're only required to have very similar distributions if you're neighbouring, if you're neighbouring here in this one, in this way you also neighbouring here and you also neighbouring here.

While there are cases where your neighbouring here and not neighbouring here. So here you have more requirements because you have more neighbouring relationships than down here. Okay. And now, uh, people have studied this kind of different kind of, uh, requirements because depending on the application, that might be more suitable. And you may be able to get better balance down here. Now balancing both. Well, we'll get to that in a second.

In the error that you get. Okay. So there are epsilon differential private algorithm for all of this. But the error on second formalise error. The error is better if the if you have fewer constraints can you handle a smaller. You will get a smaller error. You might also look at graphs. Think of social networks. Uh, each graph, you know, each user is a node. If there's a friendship relationship, you have a link edge.

And now you could look at vertex neighbourhood of graphs. There are two graphs are neighbouring if they define at most one node. Or you can say h neighbouring where the nodes are actually known. And you want to protect the privacy of the edges. And so you say, uh, two graphs are neighbouring if they define one edge. And well, you could just say that neighbouring if you have weighted graphs, the graphs are actually known but the weights are not known.

And so only the weight of one edge can differ. Okay. Um. So say it. You know, this way, while I was talking, the notion of neighbouring is very closely related to what you are protecting, right? So if you're only protecting the weights of the edges, um, then you're not protecting anything else. You're not protecting the nodes or the edges in the graph, meaning they might just as well be publicly known.

If you're each neighbouring, then the nodes might just as well be publicly known because you're not protecting them. You're only protecting the H relationship. Who is in relation with whom. And if your vertex neighbouring this is the strongest again, then you're protecting even the nodes in the graph. Okay, so what do you protect? You mean what you want to protect immediately carries over to what? Your definition of neighbouring.

And depending on your definition of neighbouring, the more was the more strict or the less strict to this it carry carryover to how much era you need. And since there's so many possible applications and combinations of what what this right definition of neighbouring, there's lots of work to be done. Now let's go back and analyse randomised response. I claim randomised response is differentially private. The question is just for which value of epsilon. Okay.

Now um but before doing that we need to have a definition of neighbouring. Right. Because I told you, um, you the most important thing in the definition of differential privacy is you need to exactly specify what is neighbouring. So here for, uh, randomised response, um, you want to protect the bit of each individual user. Okay, so then you say two inputs are neighbouring. If they differ in the bit of at most one user.

Okay. So. Think of the initial inputs here as being a string d, then two inputs D and d prime are neighbouring. If one bit this is different. It's flipped. Okay. That's the neighbouring definition that you would use here. Now, why is differential privacy such a great definition?

Well, it fulfils two main properties. And the first property is the post-processing theorem which says assume you have a mechanism that is epsilon differential private and it maps the data to some set R. And then you have some function that applies to R and compute some r prime. Then if you do the combination where you first run em and then run F on the output of em, this would be again epsilon differential private. Okay, so informally speaking, you had some data that was sensitive.

You ran it through some mechanism. There was epsilon differential private. And now the data is clean, meaning you can do any post-processing on this output of the mechanism, not on the original data. Yeah. This does not allow F to touch the original data. F is not allowed to see the original data. The only is allowed to see the output of the mechanism, but if it only looks at the output of the mechanism, it will be fine.

The whole thing. The combination of first running the mechanism and then this function is again epsilon differential private. So for randomised response this means as follows. You can really sorry this was not a good idea. This animation here. Um so you can really break it into two pieces. The first part which is what every participant had to do. Yeah this is the first part. This is really the mechanism. And you get this answers yes or no.

This is the first part. This would be the mechanism. And we will show that this here is epsilon differential private. And then taking all this answers, figuring out the percentage Q of yes answers and outputting two columns Q minus a half is just a function F is just post-processing. Okay of this differentially private output that we got here from the mechanism. Okay. So. So now in order to analyse randomised response I only need to analyse this first part.

I don't need to analyse the second part because the second part does not look at the initial data of the real users. Again has no contact to the user. Okay, so doesn't look at the initial data, only looks at the output of this mechanism. Okay. So this is one good theorem. And now you can actually do this analysis. I will not do it. I will just say you can actually do this.

You know always look at the ratio of you know if you are in there or not in there and what's the probability that you say something, blah, blah, blah. What you will find out is that this ratio is at most over three, which is E to the L and three. Okay. So for randomised response this is your homework to do this yourself. But for randomised response I'm a professor right. Whenever it gets hard it's the homework. So for randomised response epsilon is LN3 okay.

And I have to say there are different variants of randomised response where you would not use a fair coin. You would use a biased coin in your coin flips. And if you do this then you can make it also epsilon differentially private for any epsilon larger than zero. Just this simple variant that I showed to you was LN3. Okay, so this is the first useful thing. The second useful property of differential privacy is composition.

Okay. So assume you have a first mechanism that's epsilon one differentially private. In the second one that's epsilon two differentially private. And now you want to have a mechanism that you can achieve. You want to have a mechanism that just outputs both outputs M1 and m2 the output of M1 and the output of M2. Then the theorem says that this is epsilon one plus epsilon two differentially private. And that's not too hard to show based on the definition of differential privacy.

And now to do a post-processing. So once you output these things then you can now throw any function g on these two outputs. Basically post-processing is before. And that now tells you that also some function g arbitrary function g based on the output of these two things is also epsilon one plus epsilon two differentially private. So this allows you to analyse the harm or the influence on your privacy done by having multiple mechanisms,

looking at the same data and outputting something about it. Okay. So if we look for example back to this movie data. So let's say D is a true data. So for all the users, whatever they looked at and whether they liked it or not. And then some of the data might be Netflix. Okay, so Netflix would be the first mechanism. And it is it discloses something. Now, as we said before, Netflix only anonymized the data and you can show that the station does not fulfil epsilon differential privacy.

Epsilon differential privacy, meaning the epsilon would be infinite. Yeah. Now the second one is the Public Internet Movie Database. And the users might not write about all the movies they have seen. So this might be, again, just some incomplete set of the truth T and that's just disclosed. There's also mechanism to just write down whatever, I guess. Again, not at all private. Okay. So you have a second mechanism with epsilon being infinite.

Okay. So now not surprising that the combination of the two is again no privacy. But if this first one would have had some privacy epsilon one in the second one would have had privacy epsilon two. Then the combination of the two having both of them out there would be now a privacy loss of epsilon one plus epsilon two. Okay. So this is the second very nice property of differential privacy. Um, now I there's one thing that people then usually ask me at this point in time.

So I'm, I created a slide for it, which is there is two different notions of differentially private mechanisms. There's the one that you just saw, uh, which is called local differentially private algorithm. Uh, it's like a randomised response. So here the user has a two data whether they have property A or not. And they output some noise variant of it because they flip this coin. So they output a noisy variant of it and they give to the server the noisy variant and all.

What the server does is post-processing. Okay. And as we know, the server cannot do any harm because the data was already private. Um. That's how I've always thought the notion of central differential private privacy, that's more of what you would see in medical databases where, uh, the users actually give the raw data to the server, to the database.

Okay. So the database actually has the true answers in it, but then it let some differentially private mechanism run over it that then outputs a differentially private output. Okay. And so here the difference is the server has actually got all data, which is here where the server did not have the true data, only the uh, and uh the private data. Okay. That's a difference between the two. In this talk I will talk about the central model.

Just because it's easy apart from randomised response, just because it's easy and most more work in the central model. But of course now there's also a lot of work done in the local model because, um, that's very often what, what happens when, you know, uh, internet companies are collecting data about their users. It's there usually it's in the local model that the users algorithm, the user's machine or iPhone is sending this differentially private data to the server.

Now this is all great, but are there disadvantages? And yes, there are two main disadvantages. The first one is differential. Private algorithms are usually slower. Like before you would have had just to say yes I have a or not. And now you have to flip this two coins. And you know I just did this with volunteers and some talk and they had to think about, you know, when to do what. And so, you know, it's not so easy anymore. It's a bit more work.

And in general differentially private algorithms are slower than non differentially private algorithms. And the question is just by how much. And the second even larger problem is the loss on accuracy. Okay. So for example which we also call error for example for randomised response. So what we are interested in is P would have been the two answer. Yeah. And now what we the way we measure the error is as follows.

We say um what is the parameter alpha such that with probability at least two thirds q falls into this range between p plus alpha and p minus alpha. Okay. So with probability, at least two thirds Q must fall into this range. And then we say if I say accuracy. There's also a variant where, you know, this is some arbitrary parameter beta. But for this talk I just plugged in two thirds. Okay.

So the accuracy is the alpha such that with probability at least two thirds, the answer that you get will fall into this range. And of course you want the alpha to be as small as possible because you want to be as close as possible. Yeah. And so whenever you write a paper about differential privacy, it's you should of course prove that your mechanism is epsilon differentially private. Well whatever notion you of privacy use. But then please always improve accuracy.

But it's very important that you also show a bound on accuracy. I can give you an algorithm for any problem that's always epsilon differentially private. I always say zero. You know, it's definitely private. It doesn't leak any information, but it has miserable accuracy. So just writing a paper where you say, oh yeah, have a epsilon differential, private algorithm is not impressing me at all. You better show, you know, that it has always a small accuracy, even in theory.

Well, with experiments. Yeah. So it's a very important part. And I'm emphasising this because I'm seeing papers where this is not done right. And I'm kind of wondering why people would not do that. Okay. So now for randomised response. What do we get. Well you can show actually using some kind of bounds that although maybe at most uh, some constant times square root of another constant divided by n okay.

Which makes sense. So the more volunteers you have to participate, the smaller the alphabet get okay, the better the accuracy will get. If I have only three of you, you know I might be very noisy. If I have, uh, 10 million, it's much better. But so what have we seen? Um, so we have seen so far randomised response is LN3 differentially private and it's o of square root of one over n accurate. And in general the definition of accuracy is as follows.

We say a mechanism is if accurate if for all inputs d the probability again over the randomness of the algorithm that the two output minus the noisy output is at most all for that probability is at least two thirds. So there is a good chance that the answer the error is at most alpha. And so the goal in differentially private algorithms research is always you have some computational problem. Yeah. Nothing as simple as computing an average. Know something more complicated.

If you wanted to find a max cut in a graph or you know your favourite problem and, um, you want an algorithm that is differentially private, it should have small additive error. So small accuracy and the running time should be almost the running time. And also the space usage should be almost as good as the best Non-private algorithm. So these are the three things you're trying to optimise. And it's usually in that order.

Which means people. People in papers usually stop after the first two, but sometimes already after the first. Okay, now there has been a lot of research. Let me stop complaining. There has been a lot of research in differential privacy on bunch of problems. I only talk about the simplest, which is a binary count in this talk here. Um, but there's also, you know, lots of other graph properties clustering.

Machine learning is of course a very hot topic, differentially private machine learning and also mechanism design. So this is in auctions for example. Okay. Before getting into more definitions, let me give you some deployment examples so that you are motivated to keep on listening to me. Apart from having to sit here anyway, so um, there have been a few large scale deployments actually, and the first one was by Google, the wrap system, which was in Google Chrome.

Uh, these are the authors here. And, um, they built this system in Google Chrome. And what it did was it detected the most frequently chosen homepages and also the most frequently chosen, uh, running process names in Google Chrome. And that's the information it sent back in a differentially private manner such that nobody could come to you afterwards and say that you always visit this homepage. So in a differentially private manner. Um, then more a little bit more recently.

Uh, it's in Gboard. That's the Google, uh, keyboard on Android. So if you're typing on Android, um. In Google Chrome. What do you mean? You might use the cheap word. And they are. They did also learning for word prediction. So this is when you type in high and then it says the next word. Yeah or whatever. Thank. And then comes you. Yeah that's what they do in here. This example high. How off. Then they suggest you. And this is also done um based on differential privacy.

And um this is the URL of the paper. Um, yeah. They evaluated the whole system. And they did more things and they evaluated it given evaluation of the whole system. Then. As I mentioned before, it's been used by the US 2020 U.S. census. There was a lot of battle about whether they should use differential privacy or not in the census. For the 2010 census, there were some privacy leaks, some awkward privacy leaks.

And that's why they said, you know, the people who were running the census said, we need to use some protection, so let's use differential privacy. And the statisticians always said no. And the, you know, algorithms, people said yes and argument back and forth and back and forth. And so the decision was then made. Yeah a compromise. Yeah. Privacy is used. But Epsilon is 19. That was the time.

That's now the statisticians are still very unhappy. Just recently I read an article that for some native people. Right. If you have a small group, uh, then the noise that you add makes a lot of difference. And so some native people are now complaining that, you know, based on that data, they compare the 2010 data with the 2020 data. They cannot, uh, you know, say how much their population has really grown or where there are needs and blah, blah, blah.

And the census data is usually used to determine funding for native people. So it's of course important for them to be able to say, you know, we on so much. And so there are there are complaints about this for small populations, especially, uh, the statisticians are also complaining because some people, they complained, live in Manhattan in the Hudson River. And that should not have either. But, uh, then the, you know, algorithm, people say that's just noise, right?

It's not many. But anyway, so there are some complaints and there are some discussion. And it would be interesting to see what happens in the 2030, uh, census. See. And then there was also some, uh, deployment in Apple. And this is also one of the strange papers because it has no authors except for like Apple Privacy Research group. Uh, very private, I guess. Okay. Until they showed that, uh, uh, for example, how you can use it to determine frequently used emojis.

So, for example, for English speaking, if you have an English keyboard, uh, then it seems like the smiley face here where you have, you know, tears where you smile a lot. It's very popular, but the laugh is very, you know, only the second most popular, much lower level for the French, less my less. But I have much more love, for example. And so they figure that out with differential privacy.

Also, they looked at, you know, uh, memory usage, which webpages train their battery a lot in Safari and things like that, and in which health topics people are interested in because, you know, they also sell you the Apple Watch and they want to know, are you someone you know who might be interested in health? So they also, uh, collect which health topics might be interesting. And finally, there was also work at Microsoft published by these people.

And, uh, this is in Windows 10. Uh, they use this, uh, for estimating application usage statistics. And they also do the very. So this is a picture of their paper. Very careful evaluation. Okay, so you can see all become many big companies are using differential privacy already when they collect data about their users. And so now let me switch back and uh, um, tell you a little bit more about the theory behind it.

So I said, in a static setting, the most simple problem that we are going to look at is binary count. So here what you have, you see, you have a, uh, string, a binary string D of length n. So think of n users and you want to just figure out the sum so the number of ones in the string. This is basically the same as before for randomised response. Right. That was finding the population. The percentage is the same as you know finding how many have it.

And then divide by the size by n. Okay, but I will give you now a different mechanism that's more efficient. Meaning having lower accuracy. Um, so better accuracy. Um, so esp for the prime isn't neighbouring if it differs from D by one bit. And so now it turns out there is this very nice distribution called Laplace distribution. It existed before, but it looks like it's perfect. It was specifically designed for differential privacy which of was not.

But it has only become really popular uh, through differential privacy. So it's this function here. So, uh, this distribution. So the Laplacian distribution has a parameter here I call it one where epsilon. And then you get this function. So at some value x you get this density function here. And so what you have is you have then uh the variance of sigma. And so if the sigma is very large like here in the light blue, what you get is you get a distribution that's very spread out.

Yeah. So the probability that you pick minus ten is still, you know, not constant or that you pick plus ten. Um, the smaller the sigma the variance, the more picky the distribution is. Yeah. Uh, I should say it's symmetric around zero a very important symmetric around zero. And the more peaky it is. So it's a red one here for example. It's very picky okay. So if you sample from it you're very likely probability almost a half that you pick zero. Okay. So now what is a Laplace mechanism?

The Laplace mechanism is very simple. Uh, you take the two, answer the sum of the ones, and you add a random variable that you chose according to this Laplace distribution. Okay. And it turns out you can show that this is epsilon differentially private. And that is Ellen three over Epsilon accurate. So. So why is this? What is the intuition? Well, you see, if you want good privacy.

Yeah. So you have small epsilon. Then one over epsilon will be large, which means the variance of your distribution will be large. So you're more like the light blue here. Okay, so you're not too unlikely to actually add a number like ten to your numbers. But you remember, the input differed only by one, right? So you had two input strings that differed by one bit. And now you add a noise of, let's say ten to the one to the first string, the one that had this one bit not set.

Let's say, let's say the first bit. They define the first bit. The first string has zero. In the first bit the other string has one in the first bit. And now the probability that for the first one you pick the ten. And so you would output. Then whatever the two ends are plus ten. And the probability that you take the first. The second one. And you pick the nine. Yeah.

Which in this case you would then output in both cases the same thing, namely the two ends out of the first one plus ten because the second string had the first bit set. Yeah. So the probability that they are almost the same are very close right. Just a difference here between picking a ten or a nine. Okay. And so you can show that by taking this parameter one way epsilon for the Laplace definition you get exactly epsilon differentially private privacy.

It seems like it's perfectly made for that. And for the accuracy if well, you need to do a little bit of work, but not too bad. Some properties of the Laplacian distribution. Okay. But this is very nice. It's independent of n okay. So accuracy is independent of the size of the input. Okay. That's an icing here. So that's good. But what if data is dynamic. And this is now finally where my book comes in. Because I like to always look at the dynamic setting where the input changes.

Okay. And now the people cannot agree in the field what they should call the dynamic setting. They don't call it dynamic, that's for sure. They call it either differential privacy under continual observation or under continual release. Well, on streaming data. Well, sometimes on data streams. Okay. They are very creative, but it's all the same. It's where you get the data, one after the other.

Okay. So for the general model, what you have, you have a stream of input data from the universe, you of some length let's say t. And now the privacy definition needs to be adapted. Um, and I'll talk about this in in a second. But what you need to do is every time you get some input, you need to give some output again. Okay. So before in the static setting you had one input and you gave one output. So you had just one chance of leaking information right.

By now you have t chances of leaking information because you need to give two outputs after each input. And after the next bit of input arrives, you need to give some output again. So let's say for binary counting you need to keep on outputting the sum. So far the sum so far. And so you have t times the chance to leak information. So intuitively already the accuracy should go up. You need to add more noise. Okay, so one thing the open problem actually is continual binary counting.

And I'll tell you what to open about it. Um, so where you have just a stream of input data zero one of length t, and after the t split is given for any t between one and capital T, you need to output the sum of all the bits so far. And, um, now, in differential privacy and the continual observation, there was a two definitions of privacy. I'll just show you one. Um, and I'll just mention that there's a second one.

And this is the event level differential privacy. And that's very similar to what we've seen before. So we say Sigma and Sigma Prime are neighbouring. If they differ in one input value in which means in one time step, in one time step you can give something out, get something else. And then in the definition of differential privacy now you need to say over all time steps. So before we said over all neighbouring databases. And so now we need to say over all the time steps.

Uh, before we also had over all possible outputs, the outputs on our two dimensional vector. And then for any neighbouring inputs you have this definition again. And is. It's as it's late in the talk and I don't want to bore you too much. I skip this definition of user level privacy, but there is a more, uh, you know, advanced or more stringent the harder version, which is called user level privacy. So everything that I'm going to say now is about event level privacy.

And so now you care again about the accuracy. So what is the proper definition of accuracy for the continuous setting. So now you are outputting a two dimensional vector right. Because you are outputting T numbers. And so the error definition that is most popular is that you take the maximum over all time steps. So if you think of this as being a vector. So this is f of sigma is a two answers a two dimensional vector with the two answers. And um, alpha is the one that the algorithm gives you.

Okay. And we will see this in a second. Then we will take the max, the infinite in one, which is the maximum of all the entries in this two vectors. Okay. So we say the mechanism is also accurate. If for all input streams of length t, the error of the two answer versus the output of the mechanism. Here, for every plug in the output of the mechanism, the probability that most is the difference is at most, although with at least probability two thirds.

But. And now what's known about continual binary counting. Well, you could do it in a very naive way. The naive way would be to trust every time. At every time step you just add Laplacian noise. This is like a static algorithm. Run the static algorithm every time. But now you're outputting tee times. So and if you do composition right. Then the differential privacy of these tee outputs is the sum of the epsilon values.

Okay. Since you are outputting t of them, you have to make sure that each of them is epsilon over t differentially private. Okay. So you have to require each at each output you have epsilon over t differentially private which means you need to use Laplacian noise t over epsilon. Okay. And then you can just use the composition that I showed you before, and you get the sum of all these key outputs every time epsilon uh of this t epsilon numbers every time epsilon over t.

So you get that your epsilon differential product. So that's good. But what is your accuracy. If you have Laplacian noise t over epsilon your accuracy is t over epsilon which is really not good. It's the number of times steps over epsilon. Okay. And the question is can you do better. And the answer is yes, but not as good as you would like to. So what? What can you do? You can do an accuracy of loks squared t over epsilon.

That's good. Much better than t. But there's also a lower bound of log t. So this was shown by to work Noah Petersen Ross Blum and this by these authors and Chauncey and so on. They actually showed slightly worse bounds. But you can, you know, basically get these bound said, okay, so there is actually a gap. Yeah, between the lower bound.

So you cannot show anything better than log T as this lower bound, no matter which mechanism, there cannot be a mechanism better than log t. Um, but there's also this upper bound of log square T. Now, this lower bound of locked is actually really a disaster, right? Because if t goes to infinity, think of you use your phone every day, you know, and probably hundreds of times per day.

Yeah. So your t they if every time they sending some information back whether you know you looked at health related data or not for example. Yeah. Then your T is very quickly very big. And this says here that then the accuracy has to go down by a factor of log t, where the companies who are collecting data about you don't like this, right. They don't want to have such a bad accuracy.

So what they do is supposedly that's what they say is every time period, let's say every month, they give every user a certain privacy budget. And then when they whenever they use your data in some computation, they subtract something from your privacy budget. And if your privacy budget uses reaches zero, they don't use your data anymore for whatever training or whatever algorithms they use.

They don't use your data anymore. Until the next month, when they reset your privacy budget to be the initial value again. Which means basically that they just delaying. Right. The privacy loss right there. Just no, uh, they're just delaying it a little bit because for a while they're not using your data.

But that's how they deal with it. But on the other side, you know, if they don't want to have an accuracy error of to you, that's really the only thing they can do because otherwise, you know, it's too bad. Now it turns out there's a slightly relaxed version of differential privacy called epsilon delta differential privacy, which I don't want to define. And there you can do it even more complicated. There you can also actually log t to the 1.5 and was shown by Tung XI and so on.

Um, and so here we did some work together with Baga and Apache. Uh, then what we did is we actually fixed figured out the constant. They had a really bad constant. We did an algorithm, a different algorithm where we show the exact constant. And, um, the other advantage of our approach is that our accuracy loss. So our noise is very smooth. While for the other algorithm, the noise was very, um, smooth, as you can see.

So for each point in t this put a dot okay. And so you see that it was very smooth. And the reason being that the amount of noise, the number of Laplacian noises that you would add up depended on the time step. And it might even be just one Laplacian noise, or it might be locked in Laplacian noises. And so you got very non smooth noise in the setting. And in practice it's a disaster. So you want to count how many Covid cases there are right.

And you don't want something where you say I don't know where they went up or not. Maybe might have been the noise that made it go up. Or maybe the data really goes up. So you want the noise that goes up in a smooth way and not such a jumpy way. Now let me just because I promised very, very high level sketch how this algorithm works. Okay. It's actually simple. Uh, even though you might not believe it, it's very our algorithm is very simple because, you see, what do you want to output?

You want to output the sum of ones. Now, I can write this down as a linear algebra problem where you have a matrix A that is lower diagonal with all ones down here, and also rows up here. And then the output that you have is simply a times sigma where sigma your input vector. Yeah, that's exactly the output that you want. And even better if you are at timestep t. So this is the output for all. If you do this product you get all the answers for all time steps.

If I'm just an time step lower T, I can just take the t by t principal submatrix. So the small version of a yeah is up t, but I just take the first t rows and columns and take just the first t entries of my input into this product. And the t coordinate of this product is, uh, the answer I want. So this is the correct answer. And so now how do we do make this differentially private.

What we do is we show that there is a composition of A into two matrices L and R. And they are both lower triangular as well okay. That's the crucial thing. They need to both be lower triangular to work in this continual setting. And so they also have this nice property. If you restrict them to the top t by t u k the l's up t and R's up t and that product is still R is up T.

So everything is beautifully well-behaved. And so now what our mechanism does is a time step t. It takes r t and multiplies this with the input sigma t. And then it adds a noise vector to it. Okay. So it adds noise vector to it here. And then it does post-processing by multiplying with l t. So our mechanism is this very simple thing here. Okay. Where you can prove that, you know, this is an unbiased estimator and you can show that this is differentially private.

So what we get is we get a maximum epsilon delta differentially private. Um, we get this epsilon delta differentially private mechanism that has this accuracy. So then the whole proof is linear algebra. Okay. Good. So this is, uh, this is what we did here. Now, there has been work for in the continuous setting in, uh, these five cases. So we have data streams, which are sequences of numbers.

We have databases which are sequences of rows that have been analysed, data structures where you have sequences of operations, for example, heaps, um, graph algorithms where you have again sequences of operations, insertions and deletions of edges. Uh, clustering algorithms where you have insertions and deletions of points and you want to maintain some clustering of them.

Uh, all this has been studied, but of course, it's still in the beginning, meaning we don't necessarily have the best, uh, answers for them. So what are future directions? So one is this look at user level differential privacy, which I just mentioned before. Uh, the other one is local differential privacy that I defined. And there has been very little work of local differential privacy under continual observation.

And then also look at the running time and actually implement this algorithms and see how fast they are and how they actually perform. Yeah. So actually do algorithms engineering. And then the specific problems that I'm still interested in come up with better algorithms for clustering or graph properties and also, uh, close the gap of this upper bound, which is located to the 1.5 for log square, T for continuous binary counting, and this log T for the lower bound close the gap.

These are the two specific problems I want to look at. And then I want to just thank my co-authors.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android