AI is built by people. We need to listen to their stories

Speaker 1

00:04

There Are No Girls on the Internet. As a production of iHeartRadio and Unbossed Creative. I'm Bridgett and this is There Are No Girls on the Internet. I'm hosting a new season of Mozilla's podcast IRL. Online Life is Real Life. You might actually know Mozilla. They make the web browser Firefox. This season of IRL is all about AI, specifically the people who make AI, and how important it is to

00:35

put people above profit when it comes to AI. Now, it's really easy to think of AI as just computer brains and robots, but it's built and trained by people. And as much as we talk about making sure AI is ethical and equitable after it's been built and it's out in the world, we should also remember the people who build it from the very beginning too. It's something really important that I think it's overlooked in conversations about AI, turning the people responsible for making it into a kind

01:04

of invisible human workforce. But they shouldn't be invisible. We should listen to them when they speak up about this technology and how it's going to shape all of our lives. So I wanted to share the very first episode of this new season of IRL with you all here. This one is all about the risks and reward of AI technology like chatgept being open source that is built in a way that allows anyone to inspect, modify, and enhance

01:30

its code on their own. So let me know what you think and if you enjoy it, please subscribe to IRL online. Life is real life. So the first thing I ever asked chatjeepbt wasn't work related at all. It was actually for help drafting kind of a tough personal email I had to send. I was having trouble finding the right words the right tone, so I asked chat geept and I was amazed it actually produced something that

02:00

I might say. That was about a year ago. Fast forward to today and open ai is said to be on track to earn one billion dollars of revenue in the next year. Even though large language models aren't new. Suddenly more people can see the potential through that simple interface for good, for bad, and for making money. This is IRL, an original podcast for Mozilla than on Profit behind Firefox. This season we meet people who are building

02:32

artificial intelligence that puts people over profit. I'm Bridget Todd. In this episode, we get into the risks and rewards of the tech that makes chat GPT talk. We're talking about large language models LMS for short, and the controversy over suddenly giving the whole world access to build with them. But chatbots are only one example of what powerful lms can do. Imagine games where characters can chat with you more, or virtual assistants that can draft emails for you at work, banks,

03:06

insurance companies, travel agencies. Everyone is thinking about how to use this technology to increase productivity and more. But there's also a lot of talk about the risks.

Speaker 2

03:19

I think a lot of people don't understand the detailed capabilities of large language models, so you could use them to really tear apart the civic fabric of a country.

Speaker 1

03:29

That's David Evan Harris. Over five years he managed teams that kept harmful content off Facebook and later also researched responsible AI for Meta. Today, he's worried that llms can be used to generate disinformation and hate speech on a greater scale than ever. Like other big tech companies, Meta develops its own lms, and now they're urging people to use them and tweak them. With few strings attached. Metasms

03:58

are called LAMA. They might have a cute name, but David says there's a potentially ugly side to Meta's OPENLLM.

Speaker 2

04:07

I have a long history with open source and a big passion for it, but thinking about large language models and LAMA and whether or not these things are safe to be open source has been a real turning point for me. I remember more than a decade ago having some conversations with a friend at MIT about the possibility of open source licenses that don't allow for military use. We love making open source software, but what if our open source software is being used to make bombs and

04:40

kill people. We don't want to do that. Now. That connects to this question of what's the threshold for something

04:47

that we're not comfortable having open source. I just think the bigger danger that I keep coming back to, and I maybe not bigger, but the very important danger is misinformation, and is the idea that a system like LAMA too could be really effectively abused in a large influence operation campaign by what we call in the industry a sophisticated threat actor, and that basically means like an intelligence agency that probably has great hardware and big budgets and well trained engineers.

Speaker 1

05:21

David's argument echoed by many in the industry is that we don't really know how llms of today or tomorrow could be harmful in the long term. But he's also focused on the harms of the here and now and how these disproportionately affect people who are already at risk of exclusion and discrimination. So here's how I think about lms. Put on your chef's hat for a moment and imagine you're baking a delicious cake, a layer cake. The foundation or bottom layer of that cake is a large language model.

05:53

It's made out of lots of Internet data. Now, some of these ingredients aren't the best quality, but with additional layers, coloring, I think, and sprinkles, you can fine tune your system. To make a chatbot. You find tune an LM with data of people chatting to make a safer chatbot. You train it with data that shows what prompts should you er safety replies. Whenever you're building software with lms like Lama, GPT four or Falcon, that's just part of what goes

06:20

into the cake. So there are a lot of options that go into creating an AI system, even when the so called foundational models are the same.

Speaker 2

06:29

When you're using AI in a hiring system or in an applicant tracking system that's sorting through thousands and thousands of resumes. You don't need an LLM for that, but you could use llms for that kind of thing. You could use llms to give you analysis of different candidates. And there may be situations where lms demonstrate bias. I

06:51

say this because banks are using lllms too. If a bank is using an LLM as part of their process is to evaluate loans and nobody has noticed yet because that LM has never been systematically tested for bias, maybe that's introducing bias into that bank system. So I think there's some danger there. And a lot of people think, oh danger, that's not danger. And you know, if you're getting denied a mortgage because of your race, that's danger to me.

Speaker 1

07:25

David feels the industry as a whole is rushing development. At the same time, responsible AI teams have been downsized at several companies. David himself was laid off from metas responsible AI team in twenty twenty two.

Speaker 2

07:37

As a company that's using AI, or even as a government that's using AI, or a nonprofit organization that's using AI, you need to create robust processes to figure out how and when it's appropriate to use AI systems, and you need to have people who are not interested parties. And in the case of a company, an interested party might be just the engineer who wants to ship the damn

08:01

thing and get the feature running with the AI. And you need to have someone who does not have an incentive to ship products in the loop there who can say, hold on, we might need another month of testing of this. Hold on, we might need to find a way to get someone out from outside the company to really give us an opinion about if this is a fair AI system or if this is safe.

Speaker 1

08:27

The reason so many lms are at our fingertips now is that investors with deep pockets Google, Microsoft, Meta, Elon Musk and others have been pouring money into AI research and powerful supercomputers. Some companies will bake lms into their own products, others will make money by licensing access to them. Everyone is competing for influence and for engineering talent that

08:50

can help them go faster. Openness can be a strategic move to get ahead by attracting more developers, but often companies also exaggerate how open they are, since it's not always possible to see their data or methods.

Speaker 3

09:07

So I've followed these models very closely, and I know every time they're released, I know there is some element of deception.

Speaker 1

09:17

That's a Bebba Brahani. Time magazine just named her one of the one hundred most influential people in AI. She's a Mozilla advisor and a cognitive scientist from Ethiopia working at Trinity College in Dublin, Ireland.

Speaker 3

09:32

I mean LAMA, for example, was introduced as OH, an open sourced large language model, and I went into the paper hoping to find information, detailed information, because I work with data sets. I went immediately into the data sets section and it was just one tiny, small paragraph in that giant paper.

Speaker 1

09:52

A beeba wants to know what's inside the data sets for AI because systems trained on them mimic their biases. Just a a handful of data sets get used repeatedly across most llms, and these usually include massive amounts of Internet content from an open data set called common crawl.

Speaker 3

10:10

The Internet can be a really toxic place. It holds, you know, everything from the world's beauty to its ugliness and everything. In Betuit, for example, during our audits we've found content such as child abuse or genocide, or a lot of explicit pornographic images. You also have to make sure that personal sensitive information that could be used to identify individuals. You have to make sure things like this

10:42

are not included in data sets. That's one of the reasons why we need to audit the data sets we are using to train models.

Speaker 1

10:53

Decades of research. So the Internet has never been representative of all the world's people or languages, but in generative AI it becomes the ground truth. Abeba and her colleagues have coined a term to highlight the problem they see. Abeba, I noticed in one of your papers that y'all actually use the term data swamps, not data sets. Where did that term come from? Like why data swamps?

Speaker 3

11:15

Data swamp is an attempt to kind of express how such a huge dump like the common crawl or even large scale data sets, now how they represent not only the good and the healthy of humanity, but also the nasty and ugly of humanity, because you find all kinds of horrible, hateful, degrading texts, especially towards minoritized communities, and you find all kinds of images that is really disturbing to the human eye.

Speaker 1

11:49

Even when these enormous data sets are open, it can be too difficult and costly for independent researchers to audit because they're too big. But even using smaller samples of data, Abeba and our colleagues have uncovered a ton of problems in the past. There are audits of a leading image data set for AI documented so much racism and sexism that it was decommissioned after decades of use. So, Abeba, is it personal for you the motivation to keep going?

Speaker 3

12:18

Yeah, it is a bit personal. When I go into data sets, for example, you know the first thing I query is around you know, how black women are represented, how Africa as a continent is represented in so on. So when I see all the negative images or extreme negative, stereotypical caricatures, or you know, completely inaccurate, false, misleading informations, you feel like if you don't say anything, if you don't do anything about it, nobody else is gonna.

Speaker 1

12:54

Abeba says we need regulation to make companies more transparent about the data they use and where it came to. She says, if companies can hide this information, they can include data they don't actually have permission to use.

Speaker 3

13:07

These artifacts are not something that just remain in the labs of big corporations. These are tools that infiltrate into every social spheres. What information goes into thems, what kind of data set that is used to train them, where the data set is sourced, and the quality of the data set itself, and how the models were built, and any other important information should be open for auditing and

13:35

for scrutiny. Given that they are almost treated as social good that are supposed to serve everybody, so some level of openness is really important in terms of making them entirely open. Some people have raised the issue of if they can be accessed by everybody, bad actors can download them and use them for problematic applications. There is always a balance that we have to keep working around. We have to always try and find that is between open and closed.

Speaker 1

14:08

It's because llms and their data sets can be problematic that we need independent scrutiny of them. Could regulation empower people to work together to improve these systems.

Speaker 4

14:23

Currently, there's been a lot of kind of like polarizing discourse about open versus closed source, as if those were the only two choices, but they aren't the only two choices. It's kind of like more productive, more forward thinking to acknowledge the fact that it's a gradient, it's a spectrum.

Speaker 1

14:40

That's Sasha lucci Oni a leading researcher at a startup called hugging Face. They run an online platform for testing and developing AI. It's so popular that they've been valued at four point five billion dollars. Sasha and our colleagues have a fresh take on the open source debate.

Speaker 4

14:57

What point in the spectrum can I pick for this in this model? And I think it's important, especially for policymakers to understand that that it's not an US versus them. It's not like a two camp situation. It's really like, let's pick what works for each model. And also there's no one size fits all solution. Depending on the model, depending on the data, depending on the usage, some point in that gradient is more or less fitting.

Speaker 1

15:21

The spectrum of openness Sasha talks about. It's not just for a model's code or the data sets. It can be for a lot more like the documentation and the so called weights that determine how it works. These are all decision points on openness, along with the usage terms Sasha's research at hugging Face depends on openness. That's because it's all about how to measure and lower the environmental

15:44

impact of language models. She says training the lm GBT three emitted as much carbon as five hundred transatlantic flights, and she says open source technology helps with sustainability in other ways too.

Speaker 4

15:59

Definitely, the reasons I joined hunking Face was because I truly believe that by helping open source AI research, we can help the sustainability the energy side of things, but also in terms of democratization, like giving more people access to models that they can both use out of the box or they can fine tune them in order to fit their context better. I think that's like a net

16:23

positive for everyone. And for me, it's kind of like recycling or thrifting or or you know, buying something used and then you know patching it up or changing it a little bit to work with what you need it for. And I mean I thrift like ninety five percent of my clothes, So that's definitely a philosophy I'm really on

16:40

board with. And for me, a open source is definitely much more sustainable in the long run because you're not constantly starting from scratch, and also people can work together and so you have less wasted effort.

Speaker 1

16:53

Sasha says. A community initiative called Big Science is an example of this. About two years ago, Base backed one thousand people from sixty countries in a collaboration to develop an open m called Bloom.

Speaker 4

17:07

Was literally a thousand researchers and volunteers from all over the world who were like, hey, let's train a large language model together because we don't have the resources to

17:15

do it like each one of us separately. And it was great because we had people who are lawyers, We had people who were like specialists in archival studies to help get data from different places, Like I mean, we had all sorts of people from all over the world, and people who don't necessarily have like a supercomputer on premise, who don't work in a big tech company that can give them access to some kind of computes to train these models.

Speaker 1

17:38

Open communities like this one could be directly affected by policies that either limit or encourage important research for alternatives.

Speaker 4

17:46

During the Big Science project, I joined hunging Face because I was like, Yeah, this is the kind of work I want to do. I don't want to have to be secretive about what I'm doing. I want to do it in an open source wain and I want to help other people who don't necessarily have the means to train these kinds of models. I want to help them

18:01

also benefit from this technology. The fact that we had all these people involved in big science made the whole project and the ensuing model much more representative of society, I feel. And that's important because when these models get used in downstream models or downstream tools or systems, than any kind of information that's implicitly encoded in the model will bubble up to the service.

Speaker 1

18:25

So with all these gradients of openness, it's not only the biggest AI companies developing lms, and that can be a good thing. There's an open source alternative to chat GPT called GPT for All. Amazingly, it works without an Internet connection, and the lms are compressed so much that you can download them to any regular personal computer. GPT for All was launched by a New York startup called Nomec earlier this year as a privacy preserving alternative to

18:55

chat GPT. Tens of thousands of people flocked to it. Mixed co founder Andre Moliar.

Speaker 5

19:02

One of the biggest focuses that we have around GPD Ferol is making sure that privacy is the first thing we think about in some sense. One of the core reasons behind why we even built GPT Ferol and the ecosystem of models that came in with it, was because of all these large sort of like issues and concerns about privacy with people using open AI's models.

Speaker 1

19:23

You may not know this, but when you type prompts into chat GPT, open ai can use whatever you type to further train their models. There have even been numerous privacy lakes because of it, both corporate and personal.

Speaker 5

19:36

The privacy angle that we focus on specifically is making sure that the application in its open source form, you can see all of the code, so we start out from that. That makes it safe. We make sure that everything's audited by the community. And the next thing is that we make sure we align by all lawsome regulations

19:49

across Europe and across the US. We don't gather user specific data whatever they use, for instance, the models, and we make sure that the models can run without access to any internet, so you can go once you dinalad the models to your computer, you can turn off your Internet. If you're stuck in the jungle and you don't have access to Internet, you can ask it for help.

Speaker 1

20:07

No mixed mission is to improve the explainability and accessibility of AI. Their main software product is a data exploration tool for massive data sets called Atlas, but Andre believes GPT for All is important for them to devote resources to as a company.

Speaker 5

20:22

When you run a business, there are certain things you get the opportunity to do that you wouldn't be able to do if you were running a business. One of those is you have access to capital to be able to work on risky projects like GPT for All purely because you want to, not because you know there's some direct revenue driving source of it.

Speaker 1

20:40

Mainly, Andre says he's motivated by a wish to see AI developed by more than just a handful of companies. But he also raises a question of values and who decides how lms behave.

Speaker 5

20:51

So biases aren't always bad. So an example of a bias could be the model always you know prefers to greet you with a salutation for giving you a response. Right, that's a bias that might not be bad. But obviously there's biases that could be bad. Right, And one of these sort of important things with large language models is the fact that you can actually go in and customize this.

21:10

So if you have your own examples of data that you would like your model to be able to output, you can actually change that by training the model.

Speaker 1

21:17

Andre offers the example of open AI training chat GPT not to output hateful statements. Today, JPT for all gives access to models fine tune not to offend, as well as some that aren't. Andre says they've had some backlash from people criticizing them for giving more people access to lms that could be used for harm.

Speaker 5

21:37

The reality is, like this technology isn't going away. The biggest thing is we need to learn how to live with it and how to be able to cope with the side effects that emerge from it. A lot of them will be positive, some of them are going to be negative. Like one of the things that I guess I think about quite a bit is like what happens in the twenty twenty four election in the United States.

21:54

You can go in and pick ten thousand people, get their Facebook profile, and customize a chatbot that apt to be a human to convince them to think one way or the other, and you can do that for no cost at all. I guess the thing that keeps me awake at night is if we're going to live in this inevitable world where we're surrounded by machines that can generate synthesized versions of information, and all that information is

22:17

being piped from one or two company servers. If there's a world where someone like open AI owns all the pipes for the information flow, and then they get the chance to manipulate.

Speaker 2

22:27

That however they want.

Speaker 5

22:29

This is like why we do what we do. We want to make sure that these generative AI models that persist through the world are built with everyone's view into how the models are being created, not just a couple of organizations behind closed doors with unlimited resources.

Speaker 1

22:50

Llms are here. Open source communities that do put people ahead of profits are crucial to unlocking the positive potential of generative AI. The challenge for builders and regulators is to find that balance on the one hand, so generative AI isn't developed or deployed in harmful ways, and on the other to empower independent researchers to contribute to how systems work. I'm bridgetad Thanks for listening to IRL Online. Life is Real Life, an original podcast from Mozilla than

23:21

on Profit behind Firefox. For more about our guests, check out our show notes or visit irlpodcast dot org. This season, we're talking about people over profit in AI Mozilla Reclaim the Internet

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript