¶ Auditing Opaque Recommendation Systems
🎵 Music
the data skeptic.
Exploring the methods, use cases, and consequences of recognition.
🎵 Music
Welcome to Data Skeptic Recommender System. You know, there's a lot of discussion these days about social media. There's some important laws being passed in other countries and in places even here in the US. Now you can come at this problem philosophically and have a bright idea and argue for it and debate, but the most interesting angle to me would be an empirical one. How do we measure the harm caused by social media?
What is the null hypothesis and how do we prove it false if indeed it is false? Well there's a lot of great scholarship on that, including a paper we're going to talk about today called Autolike Auditing Social Media Recommendations Through User Interaction. Social media is a black box. It's not open source, it's definitely not open data. So how can you study it from the outside? Well Autolike is one great example of how to do that. We'll get into that as the focus of our interview today.
My name is Hugh Lee. I am an academic researcher working at the intersection of privacy, automated systems, and applied machine learning to study data collection practices of platforms such as smart TVs and virtual reality devices. And also how to study these practices through improving privacy enhancing technologies such as ad blockers.
More recently, I have been interested in developing frameworks to examine recommendation systems that behave as for you pages, such as those in TikTok and YouTube Shorts. So the majority of this work was done during my dissertation work at the University of California Irvine and under the Proper Data NSF Frontier Project. I am currently bringing my expertise across these areas as a senior technologist at the Federal Trade Commission's Office of Technology.
Just as a general disclaimer, I am participating in my personal capacity, not as a representative of the FTC and the views expressed are my own and do not reflect those of the Commission.
And are you open to discussing a little bit about your role at the commission? Obviously you speak on your own terms, maybe you want to leave it alone or just give people uh some understanding of the types of job opportunities that are out there?
So as a senior technologist at the FTC, I get to kind of apply my knowledge on technology security and privacy basically across a spectrum of regulatory actions that the FTC does. For example, in um I can work as part of case teams.
I could help write memos so that the agency can understand better certain, you know, emerging technologies. I can surface certain, you know, important issues that the agency or commissions may need to understand and um to help them kind of like focus on those certain issues. And also I think importantly, I get to be at the forefront of regulatory actions and to really see how perhaps academic work, research, findings can be applied for regulatory actions.
And would you say recommender systems are a focus of your work or just one piece of a larger puzzle?
I would say it's a one piece of a larger puzzle. It's something definitely that I'm very interested in and it's very timely. And also as a personal capacity, I use uh social media a lot, especially TikTok. So I think it's really fun to kind of both use it and then also do research on it.
So we're definitely gonna zoom in on the recommender system part when we discuss the paper auto alike. auditing social media recommendations through user interactions. But before we jump right in, can you say something about then like I guess the complement of your interests? What other areas uh do you spend time thinking about?
More recently I am trying to, you know, investigate and learn how academic research, particularly on you know, security and privacy, can kind of inform policymakers or regulators and vice versa. So I think it's important to understand like is it the methods or tools that are being useful or is it particular findings? For example, do we need to always find, you know, explicit harms
¶ AutoLike's Reinforcement Learning Framework
to consumers or to minors and children for it to be helpful, that kind of thing. Or perhaps maybe data sets.
Well let's jump into autolike a little bit. For listeners who haven't taken a look at the paper yet, can you talk about the project and uh the framework and what you were going for with this project?
Yeah, so the project Otterlike focuses primarily on social media platforms that have recommendation systems that behave like for you pages. So this means that the user uses the app, they see one content like a short form video at a time, and the user can interact with the for you page, meaning like they can like, bookmark, they can share.
And then they can swipe up to see the next recommended content. And these particular platforms like TikTok, YouTube Shorts, Instagram reels and stuff are very, very popular, especially with kids and younger adults. And I think it's as a user myself, it's very engaging as well. However, these recommendation systems are very opaque. You kind of don't know how they work.
and also what kind of content is being shown based on the interactions with it. And so we took a lot of inspiration from a lot of news that are, for example, from whistleblowers or from leak. or from actually other prior research and user studies as well, that these for you pages or recommendation systems can be harmful to users. And these are also kind of outlined in
the platform's community guidelines as well. For example, they can list all the kind of problematic content that should not be on the um platform, like mental health issues or dangerous challenges. So the problem is, well, how do you audit these recommendation systems in an automated way? And efficiently, meaning like for example, taking the view of a regulator who does not have a lot of resources. How can I audit the system quickly?
Well today I guess you could sign up for an account and you could use the system and you could have a a very anecdotal experience. Maybe you could get a few friends to come over and see what their uh feeds look like and then you have sort of a small, non statistically significant sample with uh some selection bias to it. What does this look like from a proper scientific framework? How do we do a good sample and take a measurement?
Yeah, so what auto like is different from what you're mentioning is that it doesn't try to capture how real users would interact with the recommendation system. At a high level, the user of Autodine can say, I want to drive the recommendation system from scratch to a particular topic of interest. For example, I only want it to serve me topics on pets. So what do I need to do? Do I need to like? Do I need to bookmark to get there quickly?
And then I we can see like how how quickly can we get there and how much content will be served, what kind and then you can also then evaluate or characterize the content that you get served. So in that particular perspective, it doesn't really need to, you know, be directly reflecting how a user would use it.
It's almost as though you're really performing a search or a what if scenario. Is it possible or what does it take to get uh let's say puppy content? I bet that's an easy one to achieve, but what if I wanted obscure seventeenth century French novels? They're probably in there somewhere. There's a lot of content, but what does it take for me to find those? Is that effectively what the the framework is doing?
Yeah, exactly. It's agnostic of the topic of interest. So in general, the user of autolike this tool can pick any topic and then go from there to figure out How you would go there. So I think a a more simpler example would be I want to see videos of cats. And along the way in the beginning they may show me popular videos of cats or dogs.
And that's kind of close to cats. So then y you know, you would intuitively like that kind of content and then keep going until maybe those which two cats or or some or something like that. You know what I mean? So there's like multiple steps that kind of content that can go through until you get to the particular topic of content that you want.
So uh that one's sort of obvious to me that if I wanted kitten content I think I could achieve that pretty quickly. If I wanted, let's say, some content about like self-harm or something like that, I would hope there really isn't much content like that on the platform, but it's also presumably harder to get to than kittens being cute.
What is the strategy of the framework? How do you get from a a a a brand new initialized state? Maybe you could talk about that too, because I know there's some interesting automation things going on here. What does it take to go out and search for the sensitive content you're wondering might be out there.
Yeah, so we formulate this problem as a reinforcement learning problem. So let me quickly explain that as a high level. So basically, from the beginning, the user of Autolike will pick a topic of interest, right? For example, cell farm, maybe it it wanted that that kind of topic. And the beginning state would be a fresh for you page. So basically we would make a profile. We can you can actually go to uh the settings and say, I want.
my for you page should be completely refreshed that you don't know nothing about me. The reinforcement learning has a particular smart agent that wants you maximize cumulative rewards over time. And it does that through trial and error by interacting with the environment. In this case, the environment is the TikTok for you page. The challenge is that the agent knows nothing about the uh environment, the TikTok, right?
So it has to learn by trial and error. So that means that it's going to try liking, you know, bookmarking, sharing, or skipping even. And then from those actions, it will get back a reward. And that reward the higher the reward, it means that the better the action is. It means that, you know, the content that's being served next would be closer to that topic interest that we have chosen.
The agent will keep learning that over and over until it will eventually learn the optimal action to take per state to get to what you want. That's kind of the overall strategy. So now really back to TikTok. we represent a state of a TikTok by just like a score, like how far from zero to one, how far are you from this TikTok is like for example, cats, is it very far from self-harm? If it is, then you know, it would be a low score.
It's kind of like topic classification. It is cats related self-harm. And if it's a low score, that would also be reflected to be a low reward. So that means that, okay, maybe we want to skip this particular cat content because it's not what we want.
And can you talk a little bit about how you label the content? I'm sure if it was pre labeled as self harm, it would be banned by the platform. So uh it's not obvious uh when something comes to you. How do you end up getting a label?
implementation wise, how does the system knows what kind of content it is? Basically what we do is we extract the TikTok URL by using the app's share feature. So for example, when you do share to a particular person and you can copy the link, we get that link and then we send it to an external server.
that would um open up the TikTok in the browser. It can then download that video. And then we can now extract certain um information about that TikTok. For example, we can get the description and the creator. of that TikTok. We can um feed it through uh OpenAI's whisper model, which takes the audio of the TikTok and turns it into text. So we concatenate the description and the auto to text.
and we feed it through, for example, any kind of AI model that can do c topic classification. So it's like those is related to cats or cell form. And then we say, give me a score between zero to one.
And then the agents themselves, I'm curious if you can elaborate on how intelligent they are. They can start from zero knowledge and just sort of randomly like or skip or take these actions, but it seems like it would be surprising or maybe inefficient to random walk from the initialized state into one of these undesirable states. Or or maybe you're gonna tell me it's actually rather easy. I don't know. What are the agents doing?
Yeah, so in the beginning for the this RL agent is not gonna know any you know, it's gonna have to do some exploration. It doesn't know which action's good, but as it learns across several hundreds to thousands of TikTok videos. It's going to be like, okay, I'm at a particular topic content that, for example, may be about mental health, but it's it's a positive, you know, it's about awareness, but it's still about mental health.
and it's kind of related to self harm, then you would get a particular higher reward and then say, Okay, now I know that liking this kind of content is going to be good. It will get me kind of closer to that end goal. So basically, how does it know it takes a reward and averages kind of like over time? And it would know like. Since I got a high reward last time, I'm gonna maybe keep doing that.
Since I will continue to get the higher reward, or I may explore another random action that is kind of unknown, but may give me better rewards than what I already know. So there's this trade-off between always doing the action that you know. versus exploring other actions that you do not know but may have better rewards.
¶ Empirical Findings and Auditing Limitations
Well it sounds like the approach is robust enough to really apply to any topic you wanted, but for the purposes of your study, what sort of topics were you focused on?
Our paper actually tries to look at both benign topics like sports, pets, weather. and also other topics like mental health. or like more negative sentiment um mental health. So what we first do is we do a control experiment across one hundred videos. A control experiment means like we're going to just simply skip all 100 videos and see what kind of content we get. Now we're going to compare, run auto-like and say, oh, I want you to drive the recommendation system to serve me cat.
And I'm I'm gonna like every time I get closer and c closer across the same one hundred videos. Or you can compare it to the control experiment and see that across one hundred the videos, liking the videos will start having the recommendation system serve you, I think like two times more pets and sports content versus just skipping any content.
Well could we talk a little bit about some of the empirical results? Were the agents able to achieve their goals?
I thought yes. So the experience that I talked about is that we are able to drive the recommendation system to personalize basically or to start serving a lot more of the content that you want. across pets, sports, weather. We also try to look at another dimension, which is sentiment. So basically like happy or sad because basically any topic has these sentiments to it. For example, mental health can be positive as a in kind of awareness.
kind of TikTok content versus, you know, more negative one, which we do see s terms of like maybe cutting or self-harmed. And so that's kind of like the negative side of it. And we found that, you know, autolike can drive the recommendation systems across both dimensions as well.
So for example, we choose, you know, sad mental health, we were able to do that compared to the controlled experiment. Another interesting result that we found is that If you do choose a sad topic, like for example, sad cats, the overall, if you look at all the content that is being served because there are other, you know, content other than sad cats, is that the overall sentiment starts to get sad as well. So you actually the agents started getting TikTok that were sadder nature.
Saturn nature and not just sad cat. I thought that was really interesting.
Can you zoom in on the ways in which you used reinforcement learning in the project?
So first to kind of understand autolike, um we I need to explain at high level what reinforcement learning means. So reinforcement learning is A type of machine learning where a smart agent is tasked with maximizing its cumulative rewards over time through trial and error. So to do so, the agent interacts with an environment.
After every action, the agent receives a feedback in the form of a reward. It takes this reward and learns from it to take better actions in the subsequent steps to earn higher rewards. So for example, if we take a city map where we need to drive from point A to point B in a 2D space. So at each intersection, we need to decide to turn right, left, or go forward, right? After each d decision, we get a reward where high rewards reflect that we got closer to our destination.
So the challenge here is actually is that the agent, you know, knows nothing about the environment. So it knows nothing about the map. And it has to learn it from scratch. So in the beginning it may make certain actions into certain uh turns that may, you know, go farther away from the destination. And second, the agent will have limited tries to interact with the environment.
I must choose an action wisely at each step. So for example, let's say there's a limit to how many turns that the car can do to get to a destination. So now the problem gets a lot harder. Um so the There's a lot of extensive um literature on this and it has been applied across many different problems. Um, such as, for example, training an RL agent to play a video game or figuring out which advertisement produces the most clicks and buys from users.
In my own work, I've applied RL to generate uh filter rules automatically to block ads and tracking on the web. And I've also applied reinforcement learning with Royan Safe's Censor Planet Lab at the University of Michigan to efficiently learn which internet domains are being censored or blocked across geographic region. So here I'm also applying this kind of reinforcement learning framework in auto-like to audit uh recommendation systems. Okay, I hope that that was a good overview.
So now how does reinforcement learning is used for this particular work? Basically, what we look at is a user first selects a topic of in of interest. For example, the user wants to drive the recommendation system to serve it uh cat cat content. We start from scratch, you know, a very fresh for you page, and it knows nothing about the user. This agent is going to interact with
the platform like TikTok and it's going to try different different actions available to it, like liking, bookmarking, sharing, or watching. And it knows nothing about how these actions will drive the recommendation system to, you know, to serving the cat content yet in the beginning. So it's going to keep watching videos one step at a time. And after each step, it looks at the kind of content that's being
recommended. So if it's close to Casper example, maybe it pets or dogs, then it will receive a high reward and that means that it will know for next time that it should, you know, like that type of video.
Well if I'm not mistaken, auto like, although we've described it so far what it does and the I've described it like a proof of concept that indeed it can find problematic content uh but it's designed for auditors. Who do you see as the users of the system and what are they using it for?
Yeah, that's a good question. So we envision users of auto like such as regulators, for example, from the FTC who want to perhaps capture evidence that a platform is recommending certain types of content in a very, you know, easy to get there manner. For example, if I only need to view fifty videos and or interact with it fifty times and I'm already getting a bunch of problematic content.
then that may be an issue. It can also be for the platform designer, right? So they can be like, oh, I think I we in terms of capturing the kind of content that we don't want, I think we're doing a good job. But okay, let's you know, apply auto like or similar methodology on top because it treats the um platform as a black box, right? It's just trying, it's just using in an automated fashion. And maybe it does find certain content.
Or maybe it's very difficult, maybe it's across one you know, thousands of videos before it can find um certain problematic content.
Well as you'd mentioned the platform is a black box. They don't give you their source code or tell you exactly how their system works. You're maybe guessing based on some papers they've written if those things are even still relevant and all that kind of stuff. So you're never sure what you're interacting with. I guess in one individual trial, maybe that particular user was put into some A B test and they had a weird experience.
But I guess in aggregate you start to see something. Do you have a sense of like how consistent the user experiences are after running mini trials?
I think that's really difficult to answer. Definitely with our work right now it's kind of a work in progress. And you're right that we do need to do it across hundreds to thousands of more times. And there's other factors to consider such as the starting profile. You know, maybe you can make a child account or maybe an adult account, uh, which location you're in, what time you're gonna run the um the experiment or um maybe you can preset CERN
interest for for that profile and then see how it reacts. So all these factors are kind of things that we would have to run more to see how the system reacts. What's nice is that I mean it is automated, so it will
Hope.
And it's something that, for example, the systems will change anyways over time. So we can't always do, you know, manual processing. So for example, maybe we run auto like every six months or every three months, that kind of thing. And that can help try to see how platforms change over time. Now you actually brought up a very good point in terms of challenges of this type of research is that when we train the RL agent in real time, meaning we're actually using TikTok.
it takes a lot of time because you actually have to, you know, watch the video for example or and there's also overhead where you have to extract the video and then classify that kind of content. over hundreds of thousands and then multiply that also over hundreds of thousands of videos, multiply that by hundreds of thousands of experiments. That takes a long time. Yeah. So there are ways to kind of reduce that.
in terms of for example if we can simulate the environment, you know, not use real tech talk, but have a similar environment that behaves like TikTok, then it could maybe return that reward quicker. So one way to do that is for example, get a lot of data from users that have used TikTok. and then kind of use those uh user data to kind of model a simulated TikTok basically. And then that way you can completely do it super fast. Uh so I that is definitely the next step for um autolike.
So could you talk a little bit about someone who wants to use Autolike? I don't know if you offer Autolike as a service or if they can clone a repo and spin it up. How does a person get started?
That's a good question. Unfortunately, I have received requests about using it, but I have not open sourced it at this time because it's a um work in progress still right now. But definitely the plan is to open source it. as any of the other research that I've done. So and it would be available on GitHub usually and perhaps additional data sets of all the TikTok sequences and interactions that we've uh collected.
And can you talk a little bit about the engineering feat? I think maybe uh a lot of heavy duty work and coding probably went into this. It's not a simple thing to set up. What did it take to t make all these measurements with Autolite?
So to interact automatically with TikTok, for example, we use an Android phone, and then the Android has a feature called UI Automator, which allows us to kind of extract elements from the screen that we see and then you know do certain actions to it. For example, extract the screen that we see, look for the like button and then click it. Or look for the share button and then click it.
And then that's how we kind of simulate those user interactions and then we like I said we would extract the TikTok URL, we send it to our external server, and through there we would use, you know, Selenium, it's a Python framework that can control browser to go to that TikTok. download the TikTok and then extract all the information needed to classify the TikTok as whether it's on topic or not or whether it's sad or happy.
Very neat. And could you maybe give some broad level uh thoughts on I guess like the throughput or scaling this up? Uh if you had a a massive auditing effort you wanted to undertake, measuring lots of topics and categories, what would it take to get that going?
¶ Future of Platform Accountability and Research
Yeah. Doing it in real time, that's very difficult to scale. That means that you would have to have maybe tens to hundreds of Android devices and a server that can handle all the processing, all the downloading of the videos and also, you know, uh doing zero shot classification of all those videos to get those scores for what type of content that we're looking at. So this is why we are looking to the next steps of how do you
train the RO agent without actually going to TikTok. And that way you can we can actually scale it much higher. And then once you train it, you can just apply it to TikTok and have it do what it does.
So it if an auditor then takes on a project using autolike, they have a goal in mind. Um we've described a lot of I guess the raw data they could get. You could probably share details about um the policy the agent developed and the types of content it saw and compare those to the baseline uh always swipe away strategy you described.
But uh how do you envision someone formalizing that into a deliverable? I would imagine a policy person wants to either go in front of a committee or write a report or something like that. What is the, you know, the the key takeaways that they'll get from the effort to inform whatever decision they're looking to make?
Yeah, so the output of auto like is basically just gonna be a sequence of interactions and TikTok videos, right? This is only taking TikTok as an example. From there, what people can do is they can look at how quickly it took to get to a s certain type of content. They can say, oh wow, it's very easy to get to, you know, negative or problematic content. Or it's just as easy to get to cat videos as self-harm videos, something like that.
We can also look at okay, look at the types of content that we were able to see within the um one run of AutoLike, for example. You know, we can you can s also characterize that and say like, wow, you know, this platform says that it should not have this kind of problematic content, but we were able to find tens of hundreds of it.
It spans these topics. You know, maybe there's very obvious hashtags that are on these things. Maybe they can also say it's very difficult to find because there's no hashtags or the special hashtags.
There's probably no sense in which a platform could say we have zero problematic content or zero percentage. they're really just looking to minimize on that. So maybe the the world looks at it differently and I'm not close enough to the problem to understand. Let's say we want to give a report card to every social media platform for the quality with which they've made it difficult to access problematic content. Are we in a place where that's even a well posed question?
I think you're asking, okay, yeah, how easy is it to audit these platforms? Is that what you're saying?
Yeah, like if let's say we decide do we want to give a report card to everybody. Maybe they all get A's, maybe they're all failing. Uh do we even know what we're measuring?
I I I think that's a good question. I mean, there's so many factors that you can look at. For example, Autolike really just looks at the particular interactions and the content as a result. So that's what the output is, right? But there's other factors that people can look at. For example, um how does it impact the actual users? How do the users feel when they use these platforms? Is it addicting? Is it maybe is is it positive? Um are they having fun?
how is the content affecting them? And you know, more recently people are looking at the kind of addictive features that are being, you know, deployed on these particular platforms as well. So I think it really just depends on what the report card wants. I mean, maybe they could do like, oh, this is very addicting or, you know, this is very bad for kids. Or um I'm not sure. I mean, maybe these things are okay as well with limited time usage.
Some of these platforms have, for example, like TikTok used to have a feed that's just about STEM. Or can you imagine some educational for you page only? Is that still good for the user if they're able to you know, use it nonstop, is that still addicting or harmful? No, I I I'm not sure. So yeah, as a society, we're still kind of figuring out these potential harms or potential benefits.
Do you think it's gonna be psychologists or computer scientists who figure out uh how we measure and hopefully mitigate these biases?
I mean, definitely as a computer scientist, we're able to develop, you know, frameworks and tools to get data that can be helpful in characterizing these. problems, you know, but we don't r really have the expertise to understand how how do users feel that
I know you've now been pretty close to TikTok and maybe you've looked at other platforms as well. in trying to get data out of that black box and autolike is a a project that accomplishes that pretty well. What if there was a world in which these platforms were a little bit more cooperative, especially for researchers? Um if you had a wish list or something that uh you think they would be willing to do that would be beneficial, do you have any ideas on what that might be?
Yeah, that's a great question. Yeah, definitely. If the platforms are able to provide, you know, special environments or APIs that um researchers can hook into. For example, if I give an ID of a particular video content, If they can just give me back the type, you know, of content it thinks it is already, then that would be very helpful, right?
Or maybe even providing data sets on the platform's usage would also be helpful. And maybe with like, you know, where the data sets are anonymized for research use.
And what about the future of the project? Where do you see Autolite going?
The future of Otterlike first is that we're trying to scale, you know, the evaluation of the tool, which means that we need to find more data usage so that we can kind of model the platform in a similar manner so that we can run more controlled experiments without actually going to the platform or going to TikTok, which would take a long time. And then also
We can definitely, you know, apply this to all the other platforms as well that has four U pages. It's kind of the same thing. And our implementation is kind of agnostic of the app, because it's using UI automator, which is just simply interacting with an ad. It doesn't necessarily need to be TikTok.
Auto like is actually agnostic of the dimensions that I spoke about. Like for example, we looked at topics of interest and sentiment, happy or sad, but really people can look at any other dimensions that are important to them as well. For example, maybe people care about whether the content is misleading or not, or any other factors really.
We talked a little bit about what's next for auto-like. What's next for you?
Yeah, thank you for asking. So in May of this year, I will be presenting my work, CENRL, a framework for performing intelligent censorship measurements. uh with Ram, Raman and Roy and Safi at the IEEE Symposium on Security and Privacy. In July of this year, I'm also co-chairing the policy relevant privacy workshop. which is co co located with the privacy enhancing technology symposium.
Um so this workshop brings together researchers from like computer science, law and public policy along with practitioners to explore how research can better inform privacy regulation and enforcement and vice versa.
And is there anywhere listeners can follow you online or find out about that symposium as well?
Yeah, I have a website called liveonhue.com that I usually post all of my work and also all the upcoming events that I'll be attending as well. So please check it out.
Very good. We'll have some links in the show notes for listeners to follow up. Well thank you so much for taking the time to come on and share your work.
Thank you so much. It was a pleasure.
🎵 Music
