¶ Trailer & Intro
We all know that icky feeling when your privacy is violated, people feel like their phone is spying on them, and all these things are happening, right? A lot of times when we talk about privacy, there are a lot of technologies that could expose us to this privacy race. We heard a lot in news data breaches, data leaks, people's privacy.
Cambridge Analytica. Catherine Jarmel is the Principal data Scientist at ThoughtWorks and the author of Practical Data Privacy. If you don't know what data you have, if you don't know where it lives, then you kind of end up in a place where privacy is not possible. You don't know what you're going to use the data for. You're probably just amassing large cloud computing fees for no reason. Why can't we hold a higher bar for organizations to create secure and private by design
services? Maybe can help explain what is privacy by design? Privacy by design is actually a quite old concept. There's seven design principles. You see a lot of software thinking allowing user choice, allowing transparency, but also by default building things that respect the user. And a lot of people associate privacy with PII, personally identifiable information. Do you think that privacy relates just to PII or is it more than that? Privacy is a lot about having
the. Hey guys, welcome back to
¶ Career Turning Points
another new episode of the Technical General Podcast. Today we are going to cover a topic that is becoming trendy in most parts of the world, data privacy. I have Catherine German here. She is the author of a book titled Practical Data Privacy, Enhancing Privacy and Security in Data. So Catherine has a lot of background expertise in machine learning and data scientists.
But today we'll try to cover the topics in more like journalist term and maybe we can discuss the importance of data privacy and maybe some potential risks that you need to be aware of whenever you build your projects or products. So, Catherine, welcome to the show. Thanks so much, Henry. I'm excited to be here.
¶ Data Privacy Landscape
Right, Catherine, I'd love to ask my guests to maybe share some turning points in your career that you think we all can learn from that? I think one of the turning points in my career, the early ones, is I started my job in technology as a data journalist, or what we would today called data journalists. Kind of like how do we use data and interactives to tell stories and to support investigative journalism. I was at the Washington Post
doing that. There was certainly a large turning point where I left kind of the field of media and journalism and I went into startups. And at the time I went into startups that were focused on media companies as a customer base. And that led me into what I would today call large scale natural language processing or very early models compared to
what we now call AI models. But still a lot of exposure to how do we do language processing as scale and a lot of exposure to thinking through things like parallel computation. Those were the days of Hadoop and all these other types of problems that came with how do we process large scale documents, stores and use them for some sort of other service. And then around 2014 is when I moved from Los Angeles, CA, where I was based to Berlin, Germany, where I now live since
then. And this was also the time when NLP, So the space of language processing was moving from more simple model designs or what we would today maybe call more simple model designs into deep learning, which basically fuels a lot of the architectures that we talked about today when we talk about AI. And I basically took a year off of working and just studied deep learning at that time to kind of update. I already knew a lot of the math behind it because thankfully I always loved math.
But to really upskill myself from how do we think of the problem from a statistical point of view? Statistical learning at the end of the day, deep learning is also statistical learning. But how do we move from that space to maybe a more kind of notion of applying concepts of calculus and linear algebra to a learning model? And that was a good idea because, yeah, kind of deep learning then ate most of the world of machine learning.
Then the final career change was maybe about 3 or 4 years after that of doing deep learning. I started thinking or became enmeshed in the problem of kind of how do we think about what we would today called trustworthy and responsible AI development? Then most of the time we use the concept of ethical machine learning, and in looking into that problem, I became very interested in data privacy and data security of machine learning systems and models themselves.
And that kind of fueled what eventually led to writing the book and obviously my current work, which is as a specialist in privacy and security machine learning models. Well, thank you for sharing your story. I find it quite interesting the journey that you had, right? And I think it's very unique that you took a career break simply to study deep learning and all that, right? So most people maybe took a break to do something else, you know, beyond work or some did go study.
But I think it's still pretty rare to go deep into a certain area and just to study and do maybe more things to get expert on that. This episode is brought to you by Swim dot IO and I'm excited to have its CTO and Co founder Omar Rosenbaum with me today to tell you more about SWIM. Hi Henry, very nice to meet you and thank you for having me. So tell us a little bit more, what is swim dot IO? At Swim, we want to help companies understand their code
bases. We combine static code analysis with generative AI to create comprehensive documents that help you navigate the code base. As an engineer myself, I wouldn't want them 10 years to spend so much time understanding existing code. I would want them to spend time creating and building new stuff. When you have code that has accumulated over decades, and especially in legacy languages that not many people are adapted nowadays, then the problem is
even bigger. Swim dot IO is specializing into helping mainframe developers to understand their code base. Why mainframes? We actually didn't start there. COBOL had been by some people obsolete for a few years, and I discovered that it's not really obsolete, not at all. There are more than 800 billion lines of COBOL code that are in production and they drive lots of the business in the world.
And we got more and more requests from customers to help them understand the legacy code visas that was written decades ago and got accumulated over a very long period of time. So from your customers so far, what are the some of the success stories that you can share? So we worked with an analyst who shared with us that it took them a year to document a single mainframe application, and using SWIM they were able to document a similar application in a matter of hours.
So saving that amount of time enables them to focus on other tasks. Thanks Amir for sharing with us about SWIM today. To learn more about SWIM, check out their website at swim dot IO.
¶ PII (Personally Identifiable Information)
So today's topic we are going to cover about data privacy. So I think in a lots of parts of the world, this topic or this kind of thing has become like the mainstream topics, right? We heard a lot in news, data breaches, you know, data leaks, people's privacy, you know, the Cambridge Analytica kind of thing, also as part of the privacy thing. So maybe if you can give us an overview first, what is the current landscape of the data privacy thing and what should people know about it? Yeah.
I mean, a lot of times I try to Orient people first with the idea of kind of personal or we can think of like sociological privacy because sometimes people hear data privacy and they just think, oh, that's for lawyers. Like that's those people are responsible. They stuttered privacy law. Those people are really, really important in the field of
privacy. I'm not trying to diminish the importance of law and regulation in the field of privacy, but I think we can all relate to privacy because each one of us probably has some sort of personal or what we might call individual understanding of our own privacy. And often that's informed by whatever cultural influences we
grew up around whatever society. Like where did privacy play or where did trust play in whatever societal bonds we kind of grew up with and we're socialized with or maybe even where we live now. So like, for example, now I live in Germany has a very different relationship with privacy than where I come from, California. And so we can also be influenced by kind of shifts or changes along our life that allow us to
see the problem differently. But essentially privacy is a lot about having the autonomy and the control to decide who I want to show up as and what I want to share with whom under what circumstances or contexts. And I think we all know that there's been times probably in our lives when information that we didn't want to share with a particular person or with groups of persons got out some which way, sometimes through technology, sometimes through a person, maybe both.
And we all know that icky feeling when your privacy is violated. And that's how it can also teach us how much it is about trust, how much it is about context, how much it is about us having a choice. And I think that you can see kind of the social constructs in a lot of the regulation that you then read. Because then of course, we have regulatory or what I would call like judicial legal understanding of privacy. And then we have a technical understanding of privacy.
And kind of the best part is when all of those work together, when the technology is reflecting not only what we legally need to do, but also what we socially and culturally understand, and knowing that that can shift depending on all sorts of different things down to the individual level. I like the three things that you mentioned, right? First, it may be relates to the context and society, right?
So privacy in one particular area might be different and maybe the norm of the culture in some other parts of the world. I also like the trusting because like most of the time when we use software, right, the concept of trust is a little bit abstract. It's vague, right? I know that I submit my details, but that's not necessary a trust being created or established when I use the software. And the last one is about choice, right?
I do have a choice to actually maybe take out my data, delete my data or whatever that is, right? So I think those are the key terms that I just picked from you what you explained just now. And a lot of people actually
¶ Data Privacy Risk in Current Technologies
associate privacy with PII, right? Personally identifiable information. Do you think that privacy relates just to PII or is it more than that? Yeah, I mean, PII is a great place to start. I don't wanna PII bash here, but I think PII is a very small view of data that we might think of as what I like to call sensitive data. Within the field of sensitive data, of course we have PII. That's things just for people who don't know that term.
That's things like e-mail addresses, birth dates, names, things that we would say might be unique to an individual and certainly in combination are unique to an individual. And then we might have what I tend to call person related data. Not everybody calls it that. Some people want to call personal data. Some people call it, yeah, data related to persons. Anyways, there's all sorts of terms for it, especially when
you get into the legal realm. But I think of person related data as like things like what I click on, what I buy can be person related, What I enter into search terms can be person related. Because when we think of all these things in combination with each other, probably when we combine them, those can also point to uniquely identifiable
characteristics of a person. And you already mentioned Cambridge Analytica. Some of the original research was focused on can we use things like Facebook likes or very short Facebook surveys or those combinations to decide how this person might vote, whether they're easily influenced in voting, and these types of characteristics which we probably would think of as sensitive.
And then we also have other sensitive data that isn't really person related that we might call proprietary or confidential data. And a lot of times we don't think of that when I think we think of privacy, because obviously it doesn't really have to do with privacy. And yet the types of protections that we use for privacy can also be applied to things like corporate secrets, other proprietary confidential information that we also don't want to share outside of a particular context.
And so these all kind of can fall under some idea of data that we might need to protect in a different way than other data. Nice. Thanks for all the classifications of different person related, right? It could be the sensitive, highly sensitive ones, it could be just the person related, it could be like confidential data. So I think that totally makes sense, right? So not necessarily just like person related, it could be entity related. And I think a lot of times when
¶ Data Utility vs Privacy
we talk about privacy these days, there are a lot of technologies that could expose us to this privacy risk, right? I mean, just to mention a few things like social media, definitely. And then you have the web technologies like cookies, you know, when you browse a certain things, right? It tends to follow you from one page, one website to the other. You have the connected device. You know, we talk about, you know, this Alexa or, you know,
the Google Home thing, right? And lately also AI, right? So people just, you know, ask AI about certain stuff. Sometimes it could take your data away, maybe looking at those technologies. What will be your advice for people when they use those technologies, right? Because it's such a ubiquitous thing now. What should people care about in terms of their privacy? Yeah, that's, that's a hard question because obviously
everybody might be different. And again, like we said, like there's also different levels of trust that people might have with different services, right. So you might really, really like Copilot or whatever it is that you use everyday and you might decide, you know what the usefulness of this is enough for
whatever trade off that I have. At the end of the day, a lot of my work is focused around informing organizations what they should be doing to better protect the people's privacy that use their services and products.
And I think we run into this problem and I think this happens a lot also in information security or cybersecurity where we kind of decide that is a user problem and that like it's your job to figure out like what you think is the most secure e-mail service and what is like the best thing. And like, oh, you're bad if you use WhatsApp because Whatsapp's not secure, which is not necessarily true, right?
But the this kind of culture, I call it kind of like blaming the victim type of culture, where why is it that I, if I want to talk to a person and they only use one particular messaging application, why is it my job to make sure to try to inform people what is the most secure or not most secure? And this why can't we hold a higher bar for organizations to create secure and private by design services so that people can choose whatever one they like. If they like the colors, I don't care.
They like the interface, they like whatever product they like. If they want to buy a package deal from a certain cloud provider, If we're looking at privacy more holistically, it's less about like you must use this software. Then we need to hold software companies and product companies to a higher standard so that people can choose whatever it is they like, which I'm like very pro-choice on that.
One of obviously there's lots of good advice out there how to inform yourself, doing things like reading through the boring privacy policies and figuring out how to make it a fun game for yourself. There's plenty of good advice to inform yourself for how do you make those choices? But I think it almost focuses like the blame on the individual
who has to use things. And at the end of the day, I want people to, yeah, be able to connect and use technology in cool ways and not have this very privileged burden of reading through every single privacy notice to figure out which best aligns. And, oh, by the way, then it gets updated six months later. They got to read through and
compare all of them over again. Mozilla has a great resource, by the way, for people who might be looking for what criteria might I use called Privacy Not Included. And that reviews things like I think they did one on autos, like automobiles. I think they did one on like connect to devices, home assistance like you mentioned,
and a few other things. And then at least you can look like what criteria did they use and maybe start to build your own criteria to inform your choices should you have the time and energy to do so. And there's no shame if you don't because at the end of the day, we got to hold the orgs accountable. Right. So very interesting the way you explain that, right. So it used to be, you know, like it's a user's problem, right? So the one who is responsible to take care about the data that
you share. And I think we can read all this privacy policy and all that sometimes in most of in, I don't know most of it, but many software, but we don't have a choice. We simply just to have to Czech agree and you know, like, you know, whatever language that they provide, we just agree, right. Otherwise we can't use anything from the software at all.
¶ Privacy by Design
So I think that's a very yeah, that's a very good segue now to actually move into towards the organizations or the teams who build the software and the products, right. So the first thing is I came from some, you know, belief, probably right last time, especially in the era of big data, we have to collect as many data as possible, you know, from the users so that we can derive insights, we can analyse things. And now this seems to be clashing, you know, with data
privacy. So in this era, these days, what would you advise people in terms of balance, right, about collecting as many data as possible versus, you know, treating privacy more seriously? I think First off, I want to say like, I'm not trying to get anybody to lose their job. So if any of the advice I gave you think would make you lose your job, don't do it right. So there's some mini disclaimer.
But I think I think as you mentioned, I think there's, there's starting to be I think a more popular shift into even just consumers, so to speak, the average person thinking more about privacy. Namely because I think this trend, as you mentioned, of the big data era of just collecting, you know, at will and doing all sorts of things that at the end of the day make people feel creeped out, you know, make people feel like their phone is spying on them and all these
things are happening, right? And in some cases that might be true, and in other cases it might be a fact that we have so much data that we can actually infer sensitive things about people without their knowledge and also without our intention, right? When we think about recommendation algorithm development, search response algorithm development, so search algorithms and content algorithms, when we think about a lot of this, there can be latent, what we call latent variables.
And what I mean by that is, you know, because you follow and you like these five things, we kind of have learned your gender or your age or your family status or these other things. We didn't explicitly try to learn them, but we kind of learned them by amassing, as you say, so much data and then putting an algorithm on top and not controlling for anything
like privacy. And then you get super creepy ads and you're like convinced your phone is spying on you and then you must hide it or you know what I mean? Like, I don't want people to be afraid of, I don't think any of us want people to be afraid of technology. I also don't personally want to be afraid to have a mobile phone. It's a nice thing to have.
And so when I think about like this shift that's happening, maybe one of the things that you can start talking about in an organization that may or may not already have a culture around thinking about privacy is how do we reason about the trust that
we're creating? And I think this is great for tech leads, as you aptly point out, and product people, because at the end of the day, when you're like a technical lead or a principal or a senior person in the type of architecture decisions, product decisions, data decisions that get made and you're a top product person, those people can have a real conversation about, do we want to talk with our customers about
this problem of privacy? Do we want to talk with our customers about what's the mental model that they think about how our service works? What parts of this is up for debate? What parts of this is not? Because there are like indeed businesses that by default, probably privacy is not going to play a huge role, right? But you can at least start that conversation.
And what we often talk about in the field of technical privacy is balancing between the most privacy we can offer, which would be collecting no data and utility, right? Is what we think about privacy, utility balance or trade-offs and utility. There we might think of as this old idea of we're going to collect every single mouse movement or single character you type in in the hopes of getting
some sort of insight. Which I would also argue as a data person, if you don't know what you're going to use the data for, you're probably just amassing large cloud computing fees for no reason. And so it's better to have smaller experiments with less data and then to grow it over time. Again, if you will lose your job for doing that, I'm sorry. And maybe you can find a new job one day. Right. So I think you mentioned something very important, right?
And many, many privacy experts also say the same thing, right? If you don't know what you collect that data for, then don't do it, right? Don't store it, don't do it, don't ask for it, right? So I think that's also very, very important, right, for everyone who builds software these days, right? Especially the technologies, right? And also when you build AI model, right? I mean, last time we also simply say for machine learning, it needs as many variables as much
as possible, right? So that it can train the model better. But again, like, if you don't know how you will want to use the data, maybe it's better not to collect them. And this comes back to this
¶ Data Governance
thing called privacy by design. I think it's mentioned a lot of times, right? In security, there's a secure by design. In privacy now there's the privacy by design. So maybe it can help explain what is privacy by design and how we should use it. Yeah, Privacy by Design is actually like a quite old concept, which is pretty cool. I think it was developed first in the early 2000s by a researcher in Canada, Kabukin, I don't know if I pronounced the name right.
Anyways, and she developed these principles as part of research. I think she was working on as somebody that both understood some of the privacy regulatory aspects and then also the software aspects. And when you read the original privacy by design, I think there's seven design principles. You see a lot of software thinking this is like really pre data, pre algorithms as like a normal part of software design.
And a lot of it talks about the same types of themes we're talking about now, allowing user choice, allowing transparency, but also by default building things that respect the user, that respects the fact that privacy does have all these, you know, is informed by context. And so if I put something in a form, it doesn't necessarily mean I want that to be used to train an algorithm, right? That we understand like how the user is imagining what it is they're doing.
And that we're building technology that hopefully mirrors that or if not, makes it more obvious what is happening. And then security principles, so basic, not only application security, but also infrastructure security principles so that we don't end up exposing something that we've collected and trusts that we are even using responsibly. But then open the system up for potential attacks or other ways of exfiltrating that data and using it for another purpose.
And this is where I think we can take privacy by design. We can use that to architect and build our systems. And then we can also evolve that thinking for new things like building large scale data systems, algorithmic systems. And then we have a huge problem now around third party data systems of the, you know, which I don't think by any means
private by design. And so when we think of third party, it can either be services we're using and we are the intermediary then between our users who maybe we have a direct relationship with and direct trust with.
And then we want to use a third party system, which is fine as long as we're really clear and as long as we're actually reviewing the third party vendors that we use for things like privacy or the other way around where we're actually a third party vendor, we have 0 access to the users. And this is a lot of us that work in the B to B space.
We don't actually, I mean, usually the companies that we're interacting with, they might have customers or they might even have customers who have customers, right? And so how many layers until we get to a human who can make a choice about whether they want their data used that way or not.
And do we have a good review process to make sure that that kind of chain of trust is verified and that it's mentally sound both the way that we're dealing with these trust relationships and also how we're eventually communicating all the way down to a person who can say, you know what, I'd actually rather not if I could opt out. I'd rather use this service, but without the interface to ChatGPT, for example.
And I think we kind of see this in some of the new products that come out that there's a little bit more transparency of, hey, we built it this way, here's the services we're using. Do you want to use this particular feature or not? At least? I see that a lot in the way that Apple is choosing to design some pieces of Apple intelligence and so forth. Is trying to have that conversation in the open. Does it always go well?
Maybe not, but at least trying to start that conversation with users in the open about how they can better understand and control the way that the data flows through the systems. And I would call that like today's updated version of Thinking through Privacy by design. So after you mentioned about, you know, third parties and how data their data intermediaries or even people these days, we use a lot of SAS, right? I mean software product teams,
right? We use a lot of SAS, you know, maybe manage database or serverless database and things like that. So that kind of like implies the data can be stored in most of other places and you know, beyond our premise, right? So I think this also speaks
¶ Retention Schedule
very, very true. The fact that sometimes if your organization grow really large, it's very difficult to actually know what data have you collected, where they reside, the retention, you know, who people get access to, right? So tell us about the importance of this, you know, data governance, because in many parts of conversation I was in about data privacy, I think the first step is always to know the inventory of your data, right? Govern that particular
inventory. So maybe tell us the importance of data governance. Yeah. And I think data governance is such a critical even outside of privacy. But you said it perfectly.
I mean, if you don't know what data you have, if you don't know where it lives, if you don't know what pathways it goes through, like when we talk about data engineering, if you don't know where it's being processed, how it's being processed, by whom along the way, then you kind of end up in a place where privacy is not possible to some degree, right?
Hopefully you've kind of, again, I'll re mention having strong practice of vendor review and vendor assessment because I think like thinking through risk always means can we create a reasonable way that we as an organization want to think about what third parties we work with, what intermediates we work with, what SAS as you point out we work with.
And there's no going to be no magical right choice, but there needs to be a commitment to some sort of choice and perhaps some thinking through of why it is that you choose one vendor over another that relates to principles like privacy and data security. But then beyond that, if we don't understand the data flows that are going through our systems, if we don't have data properly tagged, like some people might use tagging, other people use categorization, other
people use all sorts of things. But if we don't have a reasonable way of organizing the quote UN quote, data catalogs that we have and data stores that we have, then we also don't have any reasonable way to think through retention schedules.
¶ Data Privacy Practices & Techniques
What is a retention schedule? It means that we say that we collect data for a certain purpose for a given timeline. Maybe we collect purchase data for every customer, right? They go to our site, they buy something. Or maybe we hold other people's purchase date. Doesn't matter. The purchase data exists, right? And maybe we say we retain purchase data for as long as you're a customer or for up to two years after you close your account or something like this. Well, how do we even enforce
that? If we haven't bothered to like connect, then somebody has to write like some really nasty SQL query. It may or may not work if somebody did bad data entry. And this all relates to better data governance, which relates to things like data cataloging, but also things like thinking through data quality metrics and how do we ensure that we're actually getting value out of the data we collect. And what I will say is if you improve trust relationships with your customers, you will get
better data quality. Like that's by default proven time and time again. And so if you're just spamming, collecting whatever you can, especially through some sort of intermediary, you're probably getting pretty low quality data. If the signal that you're trying to measure is something like is this customer interested in ABC and you have a good relationship with a customer, then you could just directly ask them, right?
And then you have both high data quality of high trust, you've created a good privacy and security relationship with your customer, and you know that the data that you're storing is actually accurate. So I think in the end, these all end up supporting each other. And I just want to say it is not a trivial undertaking to do data governance at scale and most organizations are not doing it very well. They're just kind of starting right now, I think, for a lot of
organizations. So it's OK to decide, OK, We want our next few years to be a goal of having better data governance in which we can also have better privacy and data quality. Yeah. So I could imagine companies that have been around for years, right, having production data serving lots of customers. Definitely it will be a challenge, right, Especially, you know, these days also people work remotely.
They work, you know, using Internet, you know, something got downloaded from the system into the device, sent over to WhatsApp or something like that. Or you have data pipelines, you know, data warehouse where data moves from one side to the other. I think it's really, really challenging. I would, I would just empathize.
I also have this kind of problem as well to catalog knowing the lineage, right govern, whether it's sensitive, not sensitive parts of it, whether it can be shared with many others people or not. But I think it's definitely a challenge, but I think it will take some time. If we build awareness, I'm sure maybe we can improve the data governance that we have. So maybe let's go into the
¶ Privacy Enhancing Technologies
techniques. How can we actually improve our privacy practice, right? I think the first that is always mentioned is about, you know, pseudonymization or anonymization or data masking, right? So tell us maybe for people who are not familiar with these terms, what are those techniques, right? And how can we apply that practically? Yeah. I mean, I want to add a mention here that is maybe kind of also helps bridge data governance to like actually implementing what we might call as like
mitigations or controls. And that is, these are not decisions to be made in a vacuum of like what controls get applied where or how do we decide what we want to do with what data? How do we even decide what sensitivity a data has? Hopefully your organization has the ability to create a data governance board or practice or even a data privacy practice where there could be multiple disciplines of people in the room.
Because I think you have to have people that are business informed that know what are the business goals of why we're trying to do this thing. Data informed, software and tech, tech informed, and then of course, regulatory informed and privacy informed. So you can also hire privacy professionals who are not legal experts, but instead focus on the topic of like privacy by design.
So if you have these people together, maybe also somebody from Infosec, you can start to say, you know what we think the biggest challenges for our data governance or our data privacy or our data security are all three is these top three, some new systems that we want to develop. So we're going to put that on a kind of our collective road map for the year and we're going to prioritize those.
And I think then you start thinking once you can understand what's the risk space that you have, So what are the biggest problems that you have with the data privacy, then you can start thinking about, oh, do we need to student mise? Do we need to anonymise? Do we need to run some other
type of setup? And then we get into the fun, fun bits, which I would all fall under kind of these types of privacy controls or start to go into what we would call privacy enhancing technologies, which are different technologies that we can use to help meet these goals of privacy at our organization. And most of us have seen pseudonymization at some point in time that can take many
forms. It can be simple masking, like maybe you've used one of the popular libraries is Microsoft Presidio to try to identify things like PII and mask them. Masking to take several forms. You probably already use some hashing mechanisms or something like this, maybe a one way hash or maybe a two way like a reversible hash. These can be ways to pseudomize information. You've probably already have already used redaction.
So just simply like removing certain types of sensitive data from let's say reporting tool or ABI dashboard to say, OK, now this is ready for organization wide consumption or this is ready for our marketing report or whatever it is.
So like the data is scheduled for release publicly or at least semi publicly within an org and any of these things are small privacy mitigations that you do. And then those can lead up to more serious privacy mitigations like anonymization or even thinking through things like local first data processing, distributed data or encrypted computation. So gets very space aged the further you want to go. And those only fit certain types of problems right.
So first you have to shape the problem so you can know as every developer knows it, not to say it. If you don't know what problem you're trying to do, then just putting a technology in it is unfortunately not going to solve the problem. Yeah, that's a very, very good advice, right. So I like the in the beginning you mentioned that it's not something one like one particular team or you know, person to decide, right? So it's like a multi departmental kind of effort,
right? And there are many departments that should be involved, right? You mentioned a few that are very important, right? It could be the privacy, you know, department, the technology, definitely the product, right? And maybe the legal side in four SEC, definitely data and security is kind of like interrelated with each other and there could be many other
parties, right? And especially in the organization, I think it's very important to have this so-called decision making process to actually identify whether something is sensitive or not sensitive.
¶ Fostering Data Privacy Practice & Culture
So I think during the explanation just now, you mentioned this term that is quite trendy these days, privacy enhancing technologies. I think still tech we all love, you know, to know, like what are the technologies out there so that we can kind of like apply and try it. So what are some of the privacy enhancing technologies that are, you know, like quite modern these days that people are using
to solve this privacy thing? I think there's been some really cool developments, I would say, over the past 10 years. And I think the ones I'm most excited about are also the ones that show up in the book that I know you're now a bit familiar with, which is technologies like differential privacy, which is a way that we can reason technically about privacy.
I like to call it for people the idea of measurable privacy or rigorous privacy, because we're using a scientific way of thinking about or trying to shape the problem and then trying to decide that. And that works well for things like if you have to release data publicly, if you want to think through something like anonymization.
And it can work in many different types of use cases, but it can never work if you're trying to say customer A wants B. It has to work in some form of aggregation of an idea or a person. But of course, when we think of anonymization, we cannot ever say that we release something anonymized that could be tracked back to one individual, because then it's not anonymized, right,
By definition. So I think differential privacy is really cool technology, lots of different approaches, lots of different algorithms to think through, and also an increasing number of really cool open source libraries that allow you to think through differential privacy from a development context. Then probably the next most famous one of recent years has been Federated learning or
Federated analytics. Sometimes I like to call it more distributed learning, and it also relates to what I would call the field of local first software. I don't know if you've already had local first software experts on your show, but Martin Klippmann and who's famous for the book? Nice. Martin Klippmann's book is very famous. Sorry. Data in intensive applications. Something like that. Yes, yes, yes.
Thank you. And he and a group of folks that have kind of started this movement, I think it probably started before them, but have popularized this movement of local first data, which is more thinking through kind of exactly what we were talking about before, which is if we can now we have devices, we have edge devices, we also have edge compute like mobile devices. And certainly within whatever AWS cluster you run, you have plenty of compute.
So can we push processing as far down to the edge that we need? And therefore can we push the data as far down to the edge as we need? So this might mean what we call cross silo learning or cross silo analytics, which says you're a multinational company or you're in some sort of B to B situation and you allow every company to at least house their data within their own, you know, realm premise, so to speak.
So within their own boundaries of their cloud system or whatever it is that they're using. And they don't actually exchange data, they exchange some sort of analysis or output. So this could be machine learning processing. This could just be simple analytics, but we kind of try to push it there. Or it could go all the way down to I'm building a mobile application, I'm going to keep all of the data local and I'm only going to send certain artifacts.
Am I going to push that back up to some sort of centralized compute or some sort of redundancy compute, Right. And depending on how that works, you could even offer things like end to end encryption and then other types of things. Why would you go to this bother? Will you go to this bother? Because A, you're greatly reducing your exposure risk, right? You'd actually don't have centralized data. You just have some centralized
insights. So how attractive is it to try to hack your system much less so the less data that you have. And B, if you can get the same insights, you're also saving a huge amount on cloud compute because again, if you're like, if every ping, every two seconds I open it, you're just checking whether I'm still at my house. Like, cool.
I mean, not so cool. I don't think so, But you know, you're wasting a whole bunch of network compute and storage to just find out that, guess what, between the hours of, you know, 9:00 PM and whatever time in the morning, I'm at my house, right? When in aggregate analytics, he probably could have already figured out that is, you know, 80% of your user base or
something like this. I don't really know why you need to know that, but I just give you an example, a rough example of how differently we could engineer, architect our systems if we thought through again this question of what days you
actually need. And can I do that in some sort of privacy respecting manner so that most of the data stays far away from some sort of centralized setup and certainly far away from some sort of data sharing setup where companies are just exchanging sensitive data at scale.
And then that leads to the third type of privacy enhancing technology I'm personally really excited about, which is encrypted computation or people might have heard about that under terms like homomorphic encryption or multi secure multi party computation. And this essentially enables a whole bunch of use cases based on cryptography, which allows you to compute insights on data without decrypting it. Of course, then eventually somebody will decrypt it.
So the insight gets decrypted and used at some point in time, but the actual processing of the data can be done in an encrypted state. And that provides a quality that we like to call secrecy, which is different than privacy. But secrecy we can use to say, you know what, in our cloud compute we only ever computed in encrypted and then we move back to your device where you personally unencrypted it.
And so by the design that we have offered, we actually never saw the data in an unencrypted state except once it was on device or in a data sharing landscape. We actually only ever process data with another company by keeping that data encrypted so that neither company could learn anything other than the final released results. So any intermediary results, anything that we think might leak an individual level privacy, we can kind of cover up
with a secrecy blanket. And then at the end, when we decrypt and we pull off that secrecy blanket, we've already ensured that the end result has enough privacy that we can remove the secrecy. I hope that was understandable. You tell me. Well, I think definitely sounds really cool, all these advanced technologies, I may not be exposed to many of those technologies, but it sounds really promising, right? Especially I think the local first concept thing will be very important, right?
Especially if you don't want to get exposed to many sensitive data at your organization. Encryption, I think still kind of like the go to kind of like strategy for securing the data, right? It could be encryption at rest, it could be in transit, right? It could be just like what you mentioned, maybe just now homomorphic encryption and
things like that. These days cloud provider also come up with, I don't know, like secure chip, you know, where you can actually do more secure computing on the virtual machine. So definitely a space that maybe for those people who are interested, right, you can follow the technology's trends. So you mentioned about all this,
¶ The Legal Aspects of Data Privacy
right? Definitely 1 aspect for engineers or product teams, it makes it really, really more complicated to actually, you know, come up with the solution, right? How can we actually, you know, talk to the stakeholders that we need to, you know, improve our solution by implementing all this complex thing that might increase the effort and also the amount of resources that we need, you know, with all these technologies, with the amount of development effort and things like that.
Yeah, yeah, yeah, yeah. One thing that I think that your podcast has done well so far is like talk about lean principles, talk about creating buy in overtime, developing like a practice and eventually developing platforms, right? Because I think that's kind of the evolution that you have to work with. If you're starting with even just basic data governance, you're not going to tomorrow deploy homomorphic encryption, like that's not going to happen
for you. So what I think you have to start with is like you have to start with some of the building blocks. But try, I think from a kind of a lean product thinking, try to think through what types of these more advanced technologies actually fit the problems that
we have. You know, maybe it is exploring local first development or something and then pulling in very small use cases, very small product features or maybe even new product launches and saying, you know what, we're going to cordon off an extra 20% of the planned development time to think through an experiment. Could we use this new technology to help expedite future products so that they are launched in a private by design or secure by default type of mentality?
And what you're doing there is a, you're giving people who are interested in the topic enough space so that they can actually start exploring, right? And as you know from any new development technology, new type of thinking around development, if you give people a little bit of space and they already have some interest or motivation, they can actually go pretty far. And you're also developing this culture of learning and experimentation, which we all need.
And then B, you're also like boxing it enough that it doesn't become a blocker so that there's some sort of backup technology that you've always used that's there and that you're kind of giving, you know, you're growing the experience and knowledge. And I think as soon as you start focusing on the process rather than the outcomes, then you're going to see the rewards happen over time. And you asked like, how do we explain to the business stakeholders?
Well, maybe you don't even yourself have to explain. It will become evident by allowing time in the process. For that, you can say as simple as we need to stay up to date. We need to modernize the way that we think about privacy. And I think this is going to pay off in the long run by allowing us to build platforms that allow us to do privacy at scale, which
is the end goal, right? But along that way, we're going to add some extra time and cushion so that we can build in these new exciting technologies that I promise you, you will be able to talk about in some press release one day, right? And show that we're forward thinking and show that we're kind of advancing this space. But to start out and allow that.
And I think the advocates that will come out of allowing some of your developers or your data people or other technologists at your org to learn more deeply about this stuff is going to pay off in and of itself. Those people will end up being the advocates that educate the other people in the org about how this stuff works. And they will find, I promise you, they will find clever ways of explaining it if you give them the time and space to be able to learn.
Yeah. So I think that's a very, you know, important advice, right? So start small, right? Focus on the process, not necessarily just the outcome. And I think everyone can build awareness, I think within the team, within the organization, right, about the privacy, right, treats privacy more seriously. Obviously, one easy way to
¶ AI and Data Privacy
explain to stakeholders is about regulations, right? I'm sure in Europe it's easier to actually mention about privacy because there's the GDPR that is very strict, right? In some parts of the world, they are starting also to adopt these kind of stringent rules about personal data law, personal data protection and things like that. So let's go to that legal aspect. So what do you see the trends
these days? Especially what I know is that many countries now starting to come up with this kind of law, this kind of GDPR like regulations. So is this something that we are going to see going forward, like all countries will have their own kind of like regulations? And I think it will be, it will become very complicated if let's say you build a product that works in multiple jurisdiction, right? So maybe tell us a little bit your insights on this. Yeah.
I mean, I think it's already pretty complicated. It's probably gonna get more rather than less. I mean, that's why I think a lot of times when I talk about this, especially with high level stakeholders, like when we're talking with C level of some sort of multinational or company that wants to become multinational, we have to think about it as like future proofing the data strategy, right?
Because if we just kind of developed data or AI strategy within what we know today, we're not going to be prepared for the future. And even with what we know today is we have a very fragmented legal jurisdiction set up around almost everything data, but certainly data privacy. But that also applies to data security. It also applies to data governance protections that different organizations are looking at.
And one of the big trends that unfortunately I don't see going away anytime soon is the data sovereignty laws as well. And these data sovereignty laws are kind of around the jurisdictional control of the data. So for example, that data cannot leave a certain nation state or a group of nation states or that data cannot travel across these
untrusted nation states. And if you already work in the government or you work in government adjacent, you probably already know this pain and you're like, you're not saying anything new, Catherine. Well, now the pain is coming to you, whether or not you're government adjacent, whether or not you deal with kind of highly secure systems or critical infrastructure. I think it's kind of ballooning
out to other areas. And again, at the end of the day, a lot of this is a policy problem internally at an organization to decide, OK, what's our stance towards regulation? Where do we exist? Like what's important to us? And then obviously the legal experts, not the technology experts, get to decide, OK, here's our risk posture. Here's like the policies that we want to see related to what we know about the data, how the data stored, where it's stored, what is used for.
And then we kind of get to step in as technologists at kind of like that Olicy and principles level, I would say and say this is possible, this is not possible. This is like a good idea. This is not such a good idea because and then we can talk about things like architectures, clouds that we use, infrastructure questions alongside cool stuff like privacy technologies.
And I would say there's a non trivial number of legal professionals and privacy professionals that want to also learn more about technology or might even already know a lot about technology. And you if you build those bridges, you can have a really useful conversation around what
is this actually look like. And yeah, I think that can be probably the most healthy approach rather than just avoiding the topic because it's scary and you don't want to talk with the lawyers because they might say no. And then five years down the line they're like, why do we architect it this way? And there's like no answer other than we didn't have the conversation early enough.
Yeah, I think you mentioned something really, really, you know, very important, right to have now probably legal team, legal aspect when you build the products, right? When you architect the solution, when you architect your data storage and things like that, I think we are used to discuss about whenever we choose a cloud provider, right? It's the data center in the particular country. Now I think that's not enough, right?
Because the data transfer, you have to also look at like if let's say you use a SAS provider that resides in another country, another jurisdiction, maybe the data sovereignty law doesn't allow you to do that, right? And maybe you can have a trouble, right, when transferring data over. So I think a lot of countries also implement this differently. I think it's very hard to keep up.
And that's why probably the legal team is also still the best person to kind of like help advise the team to actually how to implement the better solution about this privacy.
¶ 3 Tech Lead Wisdom
So Catherine, we have spoken a lot about privacy technologies and all that. Is there something else that you think the listeners here should know about data privacy? I think lately I get a lot of questions around AI and data privacy. So I just wanted to note that there's a lot of new developments and thinking through, A, how do we build machine learning or AI systems that can utilize private data without problems?
But B, how do we, if we don't build the AI models ourselves, of which most organizations still don't at this point in time, how do we actually build kind of a protective blanket around these interfaces to, again, these like third party AI systems or AIAPIS? And I just want to suggest out there, obviously, I think my book is a great resource. I also have a newsletter on the topic.
But to just kind of start to inform yourself, because what I'm seeing in the circles that I'm in is that this is a growing trend and a growing problem. And I think that it will come. It will come for everyone. It will come for people that are not machine learning experts. Soon enough.
If it's not already a problem that you're facing where you're sitting there with some other team and you want to use a third party API and all of a sudden this questionnaire comes in from the legal department or how is it going to look? Or how do you actually build that with privacy or with
security, right? If you're using proprietary documents, let's say you want to build like a rag system and you want to have access to a bunch of sensitive documents that the company has, like how do we make sure that we build that
responsibly? So I just want to point to the fact that this is like a growing field of practice and you don't have to learn it like to the greatest depth, but to just kind of start to, when it pleases you, start to inform yourself lightly like, hmm, like what might that look like if we built this rag system with privacy or security? How could that look different? Do we have the capabilities to
do that now? If not, what capabilities would we need to grow and to just, you know, your tech leads, right to just start to allow that topic to percolate in your head a bit because I think it's a growing importance, I think in how we think about integrating machine learning into normal work flows. Yeah, I could imagine many teams now are kind of like scrambling to implement some sort of AI into their systems, right? And especially when you do that, sometimes we don't think further
ahead, right? Especially if you're dealing with, I don't know, like customer support, you know, customer confidential data. You think just by implementing AI it's going to make life easier, but you are not sure that they will expose certain things. Also, when you use some third party systems these days, you know, when, whenever these companies adopt AI, sometimes the choice by default is to actually enable AI training, right, using your data sometimes.
So be aware of that. And also like, I think those things need to be more transparent and explicit so that we can build trust with the third party, the systems that we use. So Catherine, I think it's been a great conversation. I learned a lot about data privacy, although it's kind of scares me a little bit on how to build a more private by design system. Unfortunately, we reached the end of our conversation.
So one thing I would like to ask you to end the conversation is what I call the three technical leadership system. So you can think of them just like an advice. So maybe if you can share your version to the listeners here. Yeah, I mean, I hope maybe I've sparked some people who might be interested in learning more about data privacy, data security topics. But the number one piece of advice is like, even if you're not inspired to learn.
I ask that you empower people on your team or in your organization to learn and that's because as you can probably tell from my advice today, the field is actually quite deep and very intensive and it's not necessarily a field that a lot of people specialize in because they studied it right.
So you going to want to open doors for people to grow and to learn and to foster kind of that ability as a specialized, my culture of learning and for you all as tech leads, like if it's coming from you, it's going to be much more powerful than like a junior who just joined the team and who's like, oh, wow, cool. Like I learned about differential privacy in college, because sometimes they do that Dow like I want to try out this thing.
Like if you can give that person space, if if you can kind of shield them and give them some ability to learn, they're going to level up the entire org eventually, right? And so certainly level up your team, but eventually this kind of spreads and creates these cultural changes that you need. So foster that learning, you know, support your people and learning. Create those spaces for learning. Try to give some back pressure to delivery deadlines to allow for some of that learning
cushion. Another thing that I want to share is I don't think all risk conversations have to be scary. I think it's totally natural. And like I, I empathize so much with it being scary because it is a big job and it is like a serious job to talk about the risk of people's data and that trust relationship. And yet at the same time, if we can normalize conversations around risk, if we can kind of allow for some of that space to have like scary topics, be part
of our Rd. mapping, be part of our planning, where we can actually kind of just regularly do privacy risk reviews, security risk reviews, auditing reviews. If we can kind of normalize that as a normal thing, like we do with like testing for bugs or figuring out if something is deployed properly, then we also make it like more tangible. And I think that it becomes less scary by default and we make it less uncertain. And I think most of the scariness comes from this uncertainty problem.
So to try to focus on that, right? And then finally, like that relates to both of these relate to like creating a team culture and hopefully an organization culture of psychological safety. And I think by knowing that we're not going to get everything 100% right all the time, we can foster both learning, we can foster risk
discussions. And we can certainly build much, much more private and secure systems by allowing us to have real conversations, by allowing us to like escalate problems when they exist, when we see them, and by empowering people to kind of have reporting up and down about what do we think about problems like privacy and security. So up the hierarchical chain of the Oregon down.
And by fostering this kind of safety conversation around these types of problems, I think we end up creating by default better software systems products in terms of privacy and security. Wow, very lovely. And you know, it all interrelated with each other, right? I especially love the second one. You know, not all risk needs to be scary, right? Sometimes it just because of the uncertainty, like we don't know about that particular risk.
But I think if you dive deep, if you learn right and you allow people to experiment and try to solve the problem, psychological safety here. So I think maybe we can all improve together. So thanks for that beautiful wisdom. So, Catherine, for people who love to talk more about privacy, is there a place where they can find you online or maybe resources where they can learn more about privacy? Yeah, I mean, I have a newsletter, it's called probably Private.
So if you're more of like a e-mail newsletter person, you can find me there. I'm starting kind of to produce also some YouTube videos. They're mainly around machine learning systems and privacy and machine learning systems. So you can find me there. My book is mainly written for data people, but I hope or I heard some feedback from also software and architects that it can be useful parts of my book. And then there's such an amazing wealth of resources online
around privacy. I would even like even just start with your cloud provider, like your main cloud provider or a few of your main service providers if you have major services and just start looking around of like what settings do they have around privacy? What do they make available to you? What knobs can you turn?
Even just starting there or even having like somebody on your team, like the art, the cloud architecture, cloud engineer on your team focus on that could end up paying dividends later of just starting to know, hey, it'd be really easy for us to turn on, you know, this or turn off this. And that would provide better privacy. So I think those are some easy ways to get started and make friends with privacy people at your org.
Be nice to them. Just reach out, set up a 1 to 115 minute coffee because they could be huge resources if you develop that or maybe you already have the relationship, but if you can develop that relationship, that can be a quick person that you can just double check your thinking with or think out loud with private, if they have a legal background, they're never going to say yes or no, which is fine.
That's what they're trained to do, good job for them, but they're going to give you advice, they're going to give you steering ideas, they're going to give you guiding questions is going to help you. So be friendly with them. Well, thanks for the plug. I'm sure if there are legal people listening to this, they will feel happy. So I think, yeah, make friends with the legal compliance team,
security team, right? So those people not there to make the job harder for you, but actually they help you and the organization to improve. So, Catherine, I love this conversation. Thank you so much for your time. I hope the listeners here learn a lot about the data privacy today. So thanks again. Thanks so much, Henry.
