The Ethics of Large Language Models with Amber McKenzie

00:01

How'd you like to listen to dot net Rocks with no ads? Easy? Become a patron For just five dollars a month you get access to a private RSS feed where all the shows have no ads. Twenty dollars a month will get you that and a special dot net Rocks patron mug. Sign up now at Patreon dot dot net rocks dot com. Hey Carlin Richard here. As you may have heard, NDC is back offering their incredible in person conferences around the world, and we'd like to tell you about them. NDC Copenhagen is

00:34

happening August twenty seventh through the thirty first. Go to NDC Copenhagen dot com for more information. NDC Porto is happening October sixteenth through the twentieth. Go to Dcporto dot com to register and check out the full lineup of conferences at NDC Conferences dot com. Hey there, this is Jeff Fritz, the purple blazer guy from Microsoft, letting you in on a little secret about my friend Carl Franklin. You know, the guy who started dot net Rocks, the

01:03

first podcast about dot net in two thousand and two. The guy who's been teaching Blazer on YouTube since twenty twenty. Yeah that Carl Franklin. Well, Carl's joined up with the folks from Code in a Castle to teach a week long hands on Blazer class at Are you ready to get this? At a castle slash villa in Tuscany. It's sort of a luxury vacation with Blazer learning

01:30

built in. Carl's calling it the Blazer master Class. You'll learn Blazer from the ground up, finishing the week with the ability to build and deploy Blazer applications. Since the training happens for only four hours in the morning over six days, you can bring your significant other, your partner with you and you should right This part of Italy is absolutely beautiful. There's so much to see and do and in Larry and Marco from Code into Castle are organizing daily activities

02:02

both at the castle and in the area. The castle is in the Marema, a less touristed region of Tuscany, offering both classic Tuscan hill country as well as easy access to the Etruscan Riviera, with sublime local food, wine and olive oil around every corner. Breakfast is included every day. There will be two communal dinners at the Castle book ending the experience and most other meals and all activities are included. And did I mention you'll learn Blazer in person

02:34

from Carl Franklin. Listen, space is limited and for very good reason. This is quality training in a beautiful setting. Go to code in Acastle dot com slash Blazer twenty twenty three that's bla z o R two zero two three to take advantage of this amazing opportunity to join Carl in Tuscany for an unforgettable week of La dulce vita while advancing your programming skills in this important new technology. Welcome back to dot net rocks. I'm Carl Franklin, and this is

03:21

Richard cattle in what's going on over on your side of the continent? My friend what packing boxes and hauling him up to the coast, Like after after we finished the show today, I'm literally loading the truck and taking a run up. We're trying not to end up with a storage locker, seeing how we're moving. We have a very big house that we're moving out over a lot of stuff into a smaller house that's already furnished. Yeah, so you know, it's just a juggling of a lot of stuff. But you're going

03:50

to have a party. Oh, there's a few parties. There's been a few parties. Well, let me know, I might just make the trek. It's a long way out, but yeah, we'll figure something out. Certainly. When we're settled in up there, we'll we'll do something fun. But the it's good to be married to a woman who who tends to argue with the spreadsheet anyway, because didn't think it's very long to figure out how much stuff the garage holds said, Okay, that's ho much stuff we're allowed

04:12

to keep, Yeah, and then condensed accordingly. Love it. Yeah, it's organized. Well, that's good to hear, Richard. Let's get started with better know a framework. Awesome, all man, what do you got? Well? This being show eighteen fifty seven. You can go to eighteen fifty seven dot twamp dot m E and that's a link to an article at Digital Trends, which I think is kind of important. Maybe it's important. We'll find out, but anyway, it's called here's why people think GPT four

04:48

might be getting dumber over time. As impressive as GPT four was at launched, some onlookers have observed that it has lost some of its accuracy and power. First of all, was it ever accurate? I don't know, but sometimes I mean it mean the underlying Harvard study is pretty great. Yeah, and they are modeling it well. I thought the bigger message from this thing was they're doing updates and not telling anyone. Yes, right, like this

05:18

is all guesswork from everybody else. Is that they're constantly refreshing this data set, it seems, and they're not really talking about where their sources are or anything like that. No, we don't, We don't really know. Yeah, Well, anyway, and also I've noticed that it don't ask it to do math. That's not his thing. It doesn't. It will lie and

05:42

and be very adamant that it is correct. Yeah, it's uh yeah in the end, when you realize it's basically just text driven and you're hoping that the text on the Internet was accurate, which is a hilarious statement all by itself. It is absolutely hilarious. Well, yeah, I'm sure Amber's got some opinions about this we're talking about, but we'll bring her on in a minute. First of all, who is talking to us today, Richard Campbell?

06:05

I grabbed a command of the show sixteen oh five, So that's going to back a bit in December of twenty eighteen, and not really about large language models, because who was talking about large languge of models in twenty eighteen.

06:15

But this was a show we did with Jared Rhodes who's at MVP, and we were talking about IoT and edge computing at the time, and mostly what we were talking about was this continuous data collection mechanism, like just our ability to collect a lot of data about people, especially with so much compute in so many places, with so much ability to gather and stream that information back. And we ended up in an ethical conversation around data collection. And

06:40

so Mike had this comment on the show shortly after it was publisher. He says, this is a great ethical discussion about collecting data. I always revert back to Kransberg's first low tech, which is it tech is neither good nor bad, nor is it neutral, which you know, I actually went and looked it up. It doesn't have to be right. It's like, again, we are personifying or anthromomorphizing technology. It's the people using the tool that

07:05

is the issue. So it just is yeah, well, yeah, don't the thing is when we attribute it to the technology rather than the people who operated. Right, Right, we're giving agency where agency doesn't exist. The mentioned stores, data gathering is troubling. So this is one of the devices we were talking about. You obviously opt in by seeing the monitor and continue to enter, But is that enough. Shouldn't the customer know to what extent

07:27

their behavior will be analyzed and for what purpose? What are the patrons who, beyond their own control, only have access to that store? Are you now alienating their business and taking advantage of the situation to gather data? So this is the store monitoring system where they're figuring out what people's buying behavior is just by having these sensors, right Yeah. Programming ethics have always fascinated me.

07:47

Recent protests, of course, is the twenty eighteen if you can call back to then, recent protests by Google employees over developing censoring products for the Chinese government and military machine learning for the US government highlights the backlash she's company's face when adopting technologies for different applications. But it always boils down to this. You will always be able to find a developer to do what you want to have done. Right there, will always be someone with the ethics that

08:11

align with yours or who's desperate enough to do that work for you. This is unique to programming. However, programming has a unique advantage of being cheap to scale, So one developer going to have a profound impact on humanity versus one unethical police officer or one unethical to investor. Had Bernie Madoff know how to make an asp dot net site, yeah, he could have offered his services anonymously to many more people. So it's always a great discussion on what

08:35

to do responsibly. How do companies talk about this? My experience is most don't, but we will eventually need to. So just seemed to be on point with what we were talking about today, and certainly what I love is dragging out a conversation from five years ago, a gay pretty relevant, kind of on the point kind of there. Hey, Mike, thanks so much

08:56

for your comment and a copy of us to Cobi. It's on its way to you, and if you'd like a copy used to go buy write a comment on the website at dot net rocks dot com are on the facebooks. We publish every show there and if you comment there and reading the show we'll send your copy mused to go by and you can definitely follow us on Twitter if you want, But we're on Mastodon. I'm on Mastodon a lot more of these days. I don't know about you, Richard, but I'm here

09:15

and there, you know. Yeah, if you want to follow me on Mastodon, it's Carl Franklin at tech Hub dot social and I'm rich Campbell at mastodon dot social. Send us a two root two twot two two boot scoot you're back there, are you? Okay? Toot scoot boogie. Okay, let's bring in Amber. Amber McKenzie is head of data science and analytics at

09:39

hr tech startup Fama Fama. Her background is in computer science and linguistics, and she's been doing data science, machine learning, and natural language processing for almost fifteen years now. Welcome back, Amber, Yeah, thanks for having me, holcome, thanks for being happy. And we had the ethics and AI conversation with you like four years ago. But I think when I wrote the email too, and like, I don't know if you've noticed, but

10:03

some stuff has happened and you pretty much picked up that threat. You're like, yeah, a few things have happened, little stuff here and there Yeah, it's an interesting time. Are we now in a place where we're really challenging that debate like that four years ago and we were talking about sort of the bias in the machine learning models and that kind of thing. Is that is chat GPT? Is that writ large? You know, it's interesting to

10:28

me and I've had this conversation a lot with people lately. There is sort of this idea that chat GBT is like some new way, different, next level, and to some degrees it is. But in my mind, it's not any different than some of the innovations we've had over the years. I mean, you know, we went from always sending letters to sending email. That was a heatge, big different thing, right we came out, you know, when Watson came out, everybody was like, Oh, that's going

11:01

to change the game. That's going to be the new thing, right, and then it was Google and now it's chat GPT. And what essentially is going to happen is that the the opposite side, the people who are trying to regulate it, trying to create things to combat you know, bias or combat uh you know, deep fakes. That type of thing is just going to catch up, right, It's always kind of a game of came ross.

11:30

Yeah, yeah, it always is. Yeah, but at least we know what we're racing against, like I would hope, Yeah, accuracy is actually important, it is. I mean I don't know. I look at Yeah, I liken it a lot to Google. We look to Google, and Google doesn't have all of the right things. And when we first came out, we were like, you know, oh, it's all magic information.

11:58

And then everybody's like, well, you gotta check the sources. I would argue that when we used to all look up everything on encyclopedias, that wasn't all exactly right. So like it's just clearly though, the way people are perceiving its change is different than all of these other things. I mean, I don't remember that much kerfuffle between governments and agencies getting together and saying, what are we going to do about this? You know, we need

12:26

regulation around this. I mean I guess they did with Google and Microsoft to some extent, but you know, other AI things haven't been Yes as hitting you over the head what this is. I'm glad to see it, though, I mean it took I remember, so I worked at Sandia in an internship and one of the problems they had at the time is all the national labs and all the agencies weren't even working together for you know, cybersecurity purposes.

12:58

So you know, they would come up with a one person would come up with a solution for a zero day attack, and then they wouldn't share it. And that's what they were working on at the time, was just getting to a point where they would even share information. So the fact that we're already like, hey, we've got this new thing out there, we need to come together and put the guard wheels in place is for me kind

13:22

of a refreshing sure take on things. And it's like, oh yeah, people recognize now that it's not just to put it out there, let it run wild thing. Let's figure out the best way to you know, especially for the sort of non technically inclined people who don't you know, have some of the understanding that is, oh, I've got to take this with a green as salt. I need to you know, check this or check that.

13:50

So saying hey, let's let's try and put you know, the bowling bumpers on and make this thing, you know, go down the lane a little bit easier. Yeah, the thing that worries me about government regulation. Isn't the regulation itself. I think that's great, but it's that government takes so long to do things and to regulate, and this thing is evolving faster than anything we've seen before in tech, at at a much more rapid pace. For example, just read an article in The Verge that chat GPT can

14:22

now remember who you are and what you want. With new custom instructions. You might not have to tell the chat about your life story every time you have a question, but do you know what you want the bot to know? I'll put a link to that. That doesn't surprise me. And I think in those instances, the private sector is going to be the one. I mean, there's going to be essentially the market opens up for yeah, for industry to say, hey, let's let's find a tech that combats that,

14:54

and they'll be a market for that. People looking to to fill that gap. But I agree with you, the big cogs are slow. Yeah. I hope the government does big picture regulation rather than you know, picking on these features and those features which can easily change from under the regulations and screw everything up. That's where I like the data source one you brought up before it's like, just at least publish where we got your data from.

15:20

Publish where that came from. Yet, I don't know if you see that as reasonable, Amber like, oh for sure, I mean that in terms of that's what's interesting about what's happened recently. For me, it is not the development of these algorithms. It's I mean, that's been happening. They're just taking it to a new level. It was the the mass market of

15:45

it. They put marketing resources around it, they packaged it that they made it suddenly visible, and that changes the stakes of the game from something that is research based. You put papers around, people have to review the papers, they need to have reproducibility to you know. Now it is a product, a commodity, which puts it in a different space. Which, by the way, Microsoft just got a nice, big old fat stock price bump

16:12

on like the marketing spoke and you're killing it in the stock market. We talked about this on Windows Weekly. It's like they put out an Inspire that it's thirty dollars a month per person and you're like, holy men, that's a lot of any three subscriptions only thirty six dollars. But if you want the copilot Now it's an extra thirty bucks. Stock price goes up five percent, Like if I'm sat an Adela, Like I'm laughing all the way to the bank, Like, that's a lot of money. What's your opinion amber

16:44

on the general public's acceptance of this stuff. I mean, it's been my experience that people are kind of like, oh, that's cool, you know, and they use it when they can and they don't think too much about it. But I guess these new regulations, some of the stuff that they're floating around sound is we want to have some sort of watermark on anything that has been generated by AI, whether it's text or video or or you know,

17:11

or whatever. But but that of course puts it in the hands of the content creators, and so you have the you'll have the problem where everybody who's following the law and everything will use the watermark, and then those who want to publish stuff for nefarious purposes will use it. Honey, that's the case of everything, right Yeah, you know, for me, it's it's one more thing that we have to teach about. I mean, right now, even in my kids English classes, they are taught about primary sources.

17:45

They are taught about not just finding something on the web and taking it at face value, and that they can't just take that information roll it into an essay and their teachers are going to take it. That's going to be just one more thing in a realm that has to be taught. But I don't see I mean, you see degrees. It's almost like the Kevin Bacon thing. You see to the degrees of separation. It's like, you know,

18:11

we're in tech, we are uniquely aware of it. And then you've got people who are pretty tech savvy who are like, yeah, this is cool, maybe you know, you know, like I have a friend who had it right as resignation letter, you know, stuff like that. And then you get, you know, to my parents who probably don't even know that it's happening right now. And it's like that degree of separation requires just more

18:33

and more education about it. And I think that's what I hope the government sort of, I mean, even if it takes a little while, the sort of the guardrails, if you will, will essentially help those people who you know, aren't aren't going to go through the diligence of understanding it and understanding how to use it. And what to do with it and try and mitigate, um, you know, the damage that can be done there,

19:03

because again, it's the same thing. I liken it to Google. I mean, I have grandparents who go and Google something and they're like, oh, but this, this, this, And I'm like, right, no, that's just not accurate. That's just not the case, right, And it's the same, you know, it's the same type of thing. And then the watermark stuff. I mean, I think we're going to come out that the software is going to come out to try and spot plagiarism, fakes,

19:30

that type of thing. It's just going to have to catch up right at this twenty ugly it's but well I don't I don't know there's going to be any uglier than it already is. It's just becoming more obvious than it is. Ugly, Like I think your language they're amber about they Google this, so it must be true, right, Just there's no thought to the idea of where Google where where that data came from that Google surfaced for you is not in consideration for most people, right yep, or at least for

20:00

some people anyway. Certainly, then we get into the confirmation biased thing where it's like I look for the link that sounds most like what I want must sound like, and that's the one I use absolutely, And in that respect, nothing has changed again. It's as you talked about earlier. It's the

20:15

scalability issue. It just makes it more accessible. In the olden days, we just had you know, snake oil people, right, and they were still doing that right, Like they're going around in their wagons and they're telling people stuff and selling stuff, and it was harder for them to get around and spread that message, but it was still happening. It's just now it's easy, it's in your face, it's accessible. It just gets further and

20:38

further that way. What if water marks become required and then there are penalties for those who don't use them. I can just see that turning into a shite show. I mean how I mean, when any we're going to put it? You know, I find it something to tweet and chat. GPT helped me write it. Do I have to put that on my tweet? And now does that mean that somebody's going to look at that and say, oh, that didn't come from my brain, you know what I mean? And then if I don't do it, when am I going to pay a

21:14

fine? And how much is the fine going to be? Is it going to be more than Elon must charges me a month? To you sweeter, I can't imagine it'd be a fun I think more likely my viewer that said I will say, I'll have the option to say I only want stuff that has a watermark, and so it's your stuff just won't be visible if it doesn't pass that quality gate. Yeah, the market's going to dictate that.

21:33

I mean to some degree. You're already seeing some of it where there have been ramifications for people that have used chat GPT and their jobs right or on their resumes or something like that, and it's already you've seen instances where it's come to kind of bite them. Those those things are going to happen. The people are still going to do some of the bad stuff and get away with it as they always do. And they're in my mind, we're gonna

21:59

end in some middle ground. I don't think it's ever going to be as far as like super regulated. That just doesn't ever really happen. They're going to try, and there's going to be some backlash and then there's you know, we're going to cut out a bulk of the easy stuff that people are trying to do, and people will find that it's not worth it to lose their job or whatever. And right, we'll end up in a space where

22:22

some people use it for good and some people use it for bad. And well, this is that Gartner hype cycle right where I think we've come off the peak of inflated expectations and we're headed down into the trough of disillusionment right now, and then you sort of climb your way back out into sort of reasonable expectations what it can and can't do. I mean, I've certainly been encouraged people don't ask, don't use this software to try and discover facts,

22:49

right, use it to discover ideas. It's a good creative tool. But any fact you want that it spits out, you need to validate, and you'll find a stunning number of them are incorrect. Yes, running is right. Absolutely, Yeah, it's the classic one. The one that I first said, okay, well here we go down the trough was the lawyers that used it to write summaries to their cases that cited cases that simply did not

23:17

exist. Didn't exist, yep, and they didn't check yep. And but you know who did check the judge and then they were in big trouble because it was their name on it. Yeah, you can't use the software wrote it. Excuse like, it's not a thing. You submitted this as you you're on the hook for submitting inaccurate information. Yeah. And and then you know, the industry, they'll you know, at that point we'll see and be like, oh, you know what, that's not lucrative for me to

23:45

do. It doesn't It might save me time, but it doesn't that the risk is not worth the reward. And I gotta think that's just like a cautionary tales for lawyers full stop. Like how many lawyers are like, Okay, well we're not using this now because it's at a risk. Although it same time, I think that's an overreaction too. It's like check your facts, right, that's always a good idea by the way you set down guidelines.

24:07

I mean, like my company sent out so they encouraged us to explore chat GBT find ways that it could kind of help us do things, but they also sent out a you know, essentially a regulatory doc that we need to sign that's like, hey, these are the boundaries, right, like, these are the ways to use it and the ways not to use it. And I imagine a lot of companies will do that, right, and it's like, hey, this is how we're okay with you guys using this

24:34

tool. This is going to become a part of the employee handbook. Yep. Yeah, especially when you see Microsoft and presumably Google productizing it. Yeah. Absolutely, Now if the company is paying for it for you, they're better be guidelines. Yeah, you would, hope, I think so. I mean we, of course we're still premature here. This is coming out

24:56

in early August. We haven't actually later hands on M three sixty five copilot, even though they've announced a price, like you can't just buy it yet, so we still don't actually know what it's going to do for us as that whether or not, as we're thirty dollars a month per per person. Wow, And that's where the you know, the market dictates, right, Yeah, they act. I mean, just with any other software, you've got to come out with the value to warrant the price. And if they

25:26

don't, yeah, then people won't pick it up. Right. Well, I think this is the other side of this, which is that we're still trying to figure out if this is a product people are actually willing to pay for. Yeah, and because the other thing we don't really know is the cost of operations, Like we know what they're willing to charge for, but I fascinated to know how much it's costing Microsoft to run those models on our

25:45

behalf essentially for free. You know, I don't know how many folks have signed up for Chat GPT plus and much less what it's costing chats GPT on the back end. I did. I have Chat GPT plus and then I and to use the API, and I found that, oh, that's a

26:03

separate charge, the GPT four API. And not only that, but you have to put your name on a waiting list, right, So yeah, if you want to use the API, even to use stuff like the playground, which I've been using on the AI Bought Show with Brian McKay, you have to have you have to be you have to be validated, and pay

26:23

extra for that because it does more than Chat GPT. I mean, I do feel a little bit good about this gatekeeping, Like it's not like this thing sitting out an open source and anybody can set one of these lms up

26:34

and running. However, they want like the fact that you have to be an authenticated developer so they kind of know who you are, and then you're working against their instance like that at least gives me some sense of governance, right, Yeah, and I just don't I mean at this time, you know, it takes an enormous amount of resources to train pre train one of these, right, and then a lot of what they're doing over over the time is fine tuning, right, tweaking it and so and I'm seeing,

27:08

you know, the job listings are out there now right to work on these things, to do some of the fine tuning and that sort of stuff. And that that too, kind of what you were talking about in the beginning about whether whether it's degrading or not. And that's I think it does so many things that trying to test a handful of things that it that it does does not really test the entirety of it and it what it's capable of that sort of thing. But that's going to be a it's going to take I

27:41

think a good bit to try and manage that process. Right, they're fine tuning it and having to basically because they've put it out to the public. Look at how that's received just like a software product, right, you know, do you put out an update and people get all mad about it, right, and then you have to kind of go back and tweak it. It's going to take them a while to sort out that that management. I

28:08

think I got it. I got a sense that when they put chat Gypt out in November of twenty two, it's because they had runned all the tests they could think of, and so now it was well, instead of continuing to pay for people to write prompts to validate that this behavior, we can get it for free by putting it out in public. Yeah right, you know, except for that whole hundred million users signing up in sixty days thing, which it was weird. You know, I don't think anybody planned on

28:36

that. I mean, their reaction to that was clearly intentional, But I don't think anybody started in November of twenty two saying, you know what, one hundred million people are going to sign up for this in two months. Well, that's a new you know, in the last few years, that's a new space for us. It didn't used to be that things like that traveled that quickly, right, Even though, right we're all digital and that

28:57

sort of thing. But now, I mean, my kids know about new things before I do, because it gets blasted out on TikTok and goes all over the place, right, And that's that's new, especially for us older people. Right, It's like that, how does everybody in my neighborhood now know about this thing? No? About this thing because TikTok, you know,

29:21

craziness. Yeah, no, it's an interesting aspect to that. But you know, I'm not a big fan of the folks saying, hey, we should put a pause on this, you know, shouldn't be productized. My instinct is against that. And yet when we talk this way, it's like, it's pretty clear you probably productized too soon, like you have not tested the edges of this product. At the same time, I also see Microsoft looking at it going they're probably spending millions a month in Azure resources operating

29:52

this thing. If you a customers signed up for it soon, you might have to turn it off. Like that's just a lot of money. So we're i'd stay stitty in I mean, and it's you know, it's a product essentially that has gone viral that you really didn't you know, I imagine they expected some but I'm with you. Imagine they didn't expect no sort of the degree to it, and it's like, how do you how do you

30:17

manage that with any sort of product? Right, all of a sudden, everybody has it and they all have opinions about it, and you know, how do you scale that up? And do you pull it back or do you try and tweak it? And how you manage expectations and backlash and all of that stuff. I mean, I don't there's a reason I work at smaller company because there's just a lot of stuff there that I'm just not interested in managing. Because at the same time, you've got Google with bards saying,

30:47

hey, this isn't ready yet. We're going to hold us back. And I think there's a sense of folks who're saying, oh, do you just say your thing's not working yet? It's like, I'm pretty sure the other thing isn't working yet either. But you know, there's all these degrees of working. They're playing game, right, and there's there's always a thing like you always have to take that that gamble when do you do something wins

31:07

the right time? Am I you know, is it going to pay off for me to wait and put out something that's later, you know, maybe more mature and received better, or maybe they're missing out because they're missing out on this initial fervor. Yeah. Again, that's not stuff that I am very good at. So also a reason I will never run a company. It's a great start up story. It's like, did you want to be

31:30

the first mover because first mover has advantages? Or do you recog or do you look at it as the pioneers the one with the arrows in his back? Right? Absolutely? Yeah, both can be true. It makes you think of that. Have you seen that F one show on Netflix. I wasn't really into F one until I saw it. It's really good. I

31:49

highly recommend it. But they the racing cars, Yeah yeah, and they show a lot of like the behind the scenes like decisions about you know, if it's raining, do you decide to put your raining tires on now or later? Or try and extend the things, or do you right? You know, when do you move maneuver and all that stuff. There's just so much you know, intricacy that goes into that. And it's the same type of thing. It's like when do I put my car out or when do

32:17

I hold it back. Could we then surge ahead and some of it pans out and some of it doesn't. Yeah, it's getting a three months jump on a market. Give me more advantaged in doing three more months of research? Right for at least it's a great question, And obviously we're seeing it

32:34

play out right now and two different positions. I think to a large degree, I think, oh, it'll be the source of books in a couple of years, right, There'll be so many books about it and what worked and what didn't and what you can learn from your for your own company in the future. Absolutely the race for AI, No, I see it, absolutely, And Amber, I'm gonna interrupt for one moment if it's very important message and we're back. It's done in rocks. I'm Richard Gamble. Let's

33:02

Carl Franklin. You talking to our friend Amber Mackenzie a little bit about Well, I just called this the ethics of large language models. But I think there is some ethical conversation going on here as we're trying and figure out the consequences of what's being built and being basically played out in front of us. I wonder how much of this also is just the culture of modern software development. Where it's like just sticking in front of the customers. We have telemetry

33:27

and we'll fix it via the internet going forward. Yeah, you don't. It's almost like you don't have to be responsible anymore because it's all fixable. It's true, but also like you put this thing out and oh you don't like how it works. Do you think it's getting worse? Well, I've got all these other people lined up to use it, you know that sort

33:45

of thing. It's like, well, we'll get the feedback. The feedback is oftentimes more valuable than anything else, right, Yeah, well that's certainly where chat GPT started, was just a feedback mechanism for more prompts and more data on how it was behaving and how people were reacting to it. So it's just, you know, this sudden flip to make it make it a product. It's interesting to see if it's going to be good and what that

34:12

even looks like. I'm still not convinced that this isn't a dead end because my experience with software has generally been if if the only way you know how to make it better is to make it bigger, it's probably not that good. Yeah, I've been saying that about LM's lately. They're just I'm one of those, uh when I non jumpers is what I call myself. So I turned to wait on a lot of things, like if my company rolls out a new software that we're going to be using, I'm like, are

34:44

we I'm not going to dive all into it. I'm going to wait to see if we're really going to use that, or in a couple of months they're going to be like, ah, no, we made the wrong decision, right. And so I actually have not been as all in on it, you know. I think that there's probably, yeah, I could I spend some time and really find ways to help my productivity or that sort of thing. Yeah, but it hasn't been this thing that I've been like,

35:07

oh my gosh, that makes all the difference. I've got to use it, you know, and I'm waiting to see where it goes. I Mean a lot of times that's true too, that if you're the second mover, you don't get the arrows in the back and you get the advantages of their hits to say, Okay, this is a better way to use it. Yeah, and maybe some better tooling, Like there's a lot of second mover advantage. There is English the only language that CHAT Gypt and GPT for uses

35:36

and is that an issue? So I don't know the answer, would I would assume not. I mean, the machine translation space is pretty solid and we have a lot of data and other languages, but I actually don't know, right, But is it translation at that point? I mean, because language models are that right, it's trained on language. It's a yeah, you know, there's actually been a way CHAT is available in fifty languages.

36:00

Okay, that's first and foremost right, But there have been some great papers about the tokenization strategy and how it doesn't map symmetrically to all languages because different languages have different architectural portrays, and because it was trained on English, wasn't

36:16

it Well, fundamentally the tokenizations technique was built by English speakers. Yeah, like I wouldn't even say, it's just it's the training set, per se in that of the adeomatic structure of the tokens is based on people who were thinking in English. And so there was a really great piece I read about its mistakes in Urdu, not that I speak or do, and I know

36:39

it's in Pakistan. It's a very old language, but architectural fundamental different language, and the token system just didn't work as well for that one appen. Yeah, and it just begs the question like if Grant we're here, he would be saying, what's language anyway? You know, it changes all the time, So what's formal English language today or Urdu language today? We'll change

37:00

in ten years. I'm reminded of that awesome show night Rider. Do you remember that night Rider with the kit kit the car, the AI car, right, And I thought one of the funniest things I ever heard about that. One of the funniest bits was from Eddie Murphy who's like, you know, the door is a jar. He says, you know, if that were in the hood, the like by yeo, man, someone stole your

37:24

battery him. It's true though, Yeah, that's actually a big I mean that speaks back to the data bias situation, where, right, the amount of data that you have in different languages or different dialects or whatever, it really does dictate sort of the outcome that you're going to get from any of these models, and most of it isn't you know, more English oriented or white oriented English? Right? Yeah, white English yep. And part of this also is now this film. It seems to be the philosophy that a

38:04

larger data set is better. Yeah, and so like and I look at it, go, I go over and look at Wikipedia. Say, there's far more English articles than anything else. Are they more accurate because there's more and more participation or are there other languages with fewer articles that are actually higher quality. Well, the argument that more data is that are it has nuances? Right? So if you are doing a particular thing, having more data that speaks to the thing that you need to do, Yes, that is

38:31

accurate. But say you're doing in classification, you've got five classes. Having more data in one of your classes that you have the most data for is actually it degrades the model for the other ones. So having more data in that first bucket, you know, say white English, having more data there does actually not help. It actually will bias the model towards those things.

38:57

Right. So you need more data in a diverse area. You need more data in the places that you don't have a lot of data, or if you don't have data about a certain subject, or if you don't have it you know, data and a certain thing you're trying to do. That helps, but not across the board. Yeah, I mean, I guess you have to when tokenizing, you have to determine, Okay, does this apply to every culture, every language or is this a language or culture specific thing?

39:25

Right? Yeah, I don't know the answer. I'm not smart enough to do that to know that well. And it's it's twofold. It's the linguistic part that is sort of the translation or the how do I you know, when I'm setting up questions, when I'm setting up prompts and I'm translating those into other languages, or even if I have a model that I'll take it from other languages, if I don't have enough data to support that,

39:47

then it's not going to be as good. But then the cultural differences, So if if even if it's the same English, if other cultures questions in a different way, if they speak in a different way, if they provide their thoughts in a different way, then it's not going to be as effective.

40:07

Yeah, yeah, yeah, And I think this is where like you can go to the platform tokenizer on open Ai and actually see how it tokenizes English GPT three just to get an idea of what they're talking about, and then recognize how English centric that actually is like, again, I don't know enough language to say now if I was fluent in and I would prefer not even a romance language, something very different. Do the tokens actually makes sense? I don't know how much, really, how much work has been done

40:39

in that area, Richard. When you say you can see how it's tokenized, is it tokenizing your questions or the data that it's trained on? It you can literally just write some text and it'll show you what about the data that it's trained on. That's that's my No, it doesn't do that, But I'm just saying, you know, folks are asking about tokenization, and it's look, I'll clue the link in the shows, but you can literally take text, plug it into that, and it'll show you how the tokenizer

41:07

represents it well. And this ends up being a problem. Actually, the productization of it becomes a problem there because products are oriented towards their users. So you know, if you're making it a product and you're trying to sell it, you're going to make it work best for the majority of the people who want to buy it, and that oftentimes is not the marginalized communities.

41:34

Well, and yeah, and you get into now you have all the market forces, which is what's the appropriate demographic for me to optimize return especially in the early days of a product like this where they're desperate to get some returns quickly. Right, They're not looking at diversity or any of those things. They're like, who's got the money that we know how to sell to now? Right? Absolutely? I mean, and again I go back to M three sixty five, which copile, which seems to be the first one really

41:59

product since Gehub Coopilot, where they're targeting large American corporations. Yep, because they because they're already they're already customers, they're going to use it on mass like there's a lot of reasons. I totally get why they would do that, but has nothing to do with any kind of sense of diversity to the data. That's probably a bigger concern in my mind than you know, some

42:22

of the misinformation stuff. The misinformation has happened, and it will happen, and people will make things to combat that, but there isn't a market to combat the lack of diversity and bias. There are smaller groups who work towards it. But again, if it's a product and it's somebody keeps it,

42:42

you know, privately held and that sort of thing. There isn't an impetus for them to to fix that, and that widens the widens the tech gap, widens the gap for people having you know, access to things that other

42:55

people have access to. It sort of heads us in a further in a direction that we have been going and as where you really have the argument about making a product in this particular scenario is that because of the need to show returns quickly, you're going to to bias towards where those returns are likely to come from. Yeah, although and I'm in our past conversation you said like step one of dealing with any biases, No, it's there. Yeah, what's step two? Well, in this case, the app to care about

43:25

it. Yeah, but I think we always did, but or we always supposed to. But that's again where it comes into the product side, is like, yeah, when you take it out of the research right, research, the research arena oftentimes will penalize for bias for things like that because you have a diverse group of researchers who were like, hey, did you think

43:47

about this thing? And we care about that, We're not going to you know, approve your paper or whatever it is, if you haven't done your due diligence, many you take it out of that realm and put it into, you know, a commercial product. What do they get for increasing the diversity there? What benefit does it do them? In fact, it's probably worse for them because they have to put more time and resources into something that

44:09

they're not going to get a return from. So yeah, it's to me, you know, and again I've done enough product work to be pretty comfortable with these terminologies. It is like you go get your pioneering group and you kind of want them to all be the same because you've only got so many resources. That diversity quotient only comes in when we're trying to broaden the customer

44:32

race. It's literally the second half before that. He becomes a question absolutely where it's like we are now limiting our market by not being broad enough. And if you think about the people who are going to pay for chat GPT, it's often not going to be, you know, a diverse group. There's a whole bunch of this world that is not going to be able to pay for Chat GPT. It's already priced at the monthly income of a true

44:59

non trivial chunk of the this world. So uh, and that and that's a whole other can of worms, right, is that you know we already have multinationals where pricing varies by country by country, and that we could get there. I don't know that that's true. And I would also say we're arguing over a product that's I think yet to show profound value. Yeah, it's certainly created profound hype, but show me that, you know, the

45:23

company with a huge competitive advantage because they've been using this technology. Lots of people are talking about it. They're certainly using excuses to lay people off over it. But you know, I want to see two quarters after that and say, did you actually get better? Did you were your customers happier? Like, I don't know that any of us know the answer to that those

45:43

things yet. Yeah, I think there's going to be opportunities, especially in the sort of data science and machine line consulting room, to really speed up those consulting opportunities, right, to take them from something that you know you you have to kind of train a new thing every time and tailor it to your client. But then to get ahead of some of that competition, say

46:05

hey, we can significantly cut down on that time. But for all the companies that don't necessarily have in house data science, machine learning, that sort of thing, you know, are they suddenly going to be able to pick up chat GPT and have it make a big difference in that area. Yeah, I don't know about that. Yeah, I think it's a great question. And funny thing is, like, this is what Microsoft's has been known for, is the commoditization of certain technologies, sure, making it very easy

46:35

to get into things so that typically needed experts. You could use these tools without the experts, and you wouldn't get a superior experience. Like arguably we'd constantly get an infior experience, but perhaps good enough that it was worthwhile. Yeah. I think it goes back to them not anticipating how big it would get immediately, right. Yeah. I don't want to present the idea that anyone had a plan here because I don't think it's true. I think they're

47:07

all making it up as they go. Right now, that the hundred million users wiled them, so they pointed at Bying of all places, and Bying got a hundred million users, you know, like this is one hundred million user machine. Let's go. Yeah, it's an interesting space right now. Well, and I'm excited to talk to you because you come from the science

47:32

side of this and the academic knowledge side of this. And I can see because we could see each other while we're recording this, Like the productization of this stuff is the worst I think for someone from your spaces, like you're doing what with my thing? Where? How? And literally it's like that sort of grotesque. And then they took the sledgehammer to it to get it to fit into the square, like just hammering it through a box to get

47:59

it to customers. There have been some musical artists who are demanding that or who are who are offering um royalties or something like that or demanding royalties. What's the story there, Richard. I'm confused about this, but I remember talking to you about it. I mean, there's there's certainly a question about uh yeah, rights to data, rights to the data used in the training

48:22

models. And you can see the thought that as a developer running these models, because they're all you believe, they're all behind the scenes, so it doesn't really matter. I'm not taking a copy of your thing, I'm not using your thing per se. It's just being munged in a big neural net, like why should I care? And everything was fine until Dolly started spitting out Getty logos. Then it wasn't fine. Yes, right, this is

48:51

there's there was no intent here, There was no plan here. I think a devs have it always been doing what devs were doing until the logo, until the logo started appearing in the pictures, and then it's like that, huh. And I love that it was Getty because Getty is the greatest pursuer of copyright you've ever seen. Heaven help you if you use a Getty picture

49:14

without permission. Ah, it's all a big I mean, it just opened up a lot of stuff that I think people have been talking about for a while, but that the like need wasn't so great that people had to really pursue, you know, some of these I mean, we had to do some of that, right, Napster days and then there's you know, there's Spotify days and all of that stuff. Everything's always changed and it's just one

49:39

more time that those things are gonna have to change. But they just the speed with which it came out, it's going to take a little bit for the extra things to catch up. And be like, all right, we have been having this conversation for years, including the three of us. Yeah, Like, in some ways this is a useful forcing function. Okay,

50:05

I've now finally made a thing that's showing up in the public gestalt. Now we've been talking about treating data maturely, respecting copyright, considering bias, like all of those things. Now ya gotta because we're now starting to stick it in front of regular people who don't know how problematic these data collections and utilization methods have been. Yeah, I mean that's what normally drives innovation, right,

50:30

every time, other innovation. There's no driver's licenses until cars started killing people. Yeah, true, right, Just hopefully we can get we can get to a point of some moderation with this before folks die. I would rather not have folks die over this crazy bit of software. Yeah, crazies, right man, people are going nuts. It's true. I had the same thing comes to fruition on the environmental impact side, because these models are

51:00

are very extensive and very impactful to train. Yeah, and now that they are out there and people are using them, that likely means that more of them are going to be trained, and more of them are going to be deployed, and it's like, okay, how are we going to then also mitigate that impact as well? Well. This one good thing that comes out of it is now people won't be complaining so much about how much water it

51:20

takes to grow a stake. That's true. I always thinking more in terms of you know, Jess Wing Crypto was winding down and giving back those resources. We found a new place to put them. Yeah, exactly, absolutely. Yeah. And if you want to make an AI person angry in a big hurry, just call it the new Crypto. That's a good way to do it. Oh yeah, we will not get a Christmas card from that

51:45

person. That's not a happy thing to say. That's awesome. And I also think it's not true, like it doesn't seem to be trying to separate people with our money quite as vociferously. But the bill, you know, the rates are the prices are starting to come. Man. I guess we'll see what happens. Yeah, it's true. It'll be an interesting back half of the year. Yeah. So Amber, what's next for you? What's in your inbox? Ah? Well, I am doing good things at Kauma.

52:13

I'm happy to be there. And for me, I'm finally just in a steady state, which is great, and looking forward to kind of just seeing where things go. I've been throwing a lot of energy into sort of

52:29

mentoring and growing the next data science leaders. I've noticed that there's a lot of they teach a lot about the data science and then they find people who are like, oh, well, you look like you could lead, We're going to throw you into leadership, but they don't do a lot of teaching about data science for business value or data science you know, in the bigger

52:52

picture, and how to lead a team and that sort of things. I've been doing a lot of mentoring some of the younger data scientists, which I've been really enjoying. Wow, that's great. Well yeah, yeah, you're doing You're doing the people's work for sure, because we goodness knows, we need more knowledgele people in this space. Lots of people using data, not enough data scientists. Yep, that's true. Yeah, well, Amber,

53:15

thank you so much for coming on the show and talking about this. It's not a topic that's going away soon, and I'm sure it's going to change by the next time we have you on. Absolutely. Thanks for having me. Great to have you about Yeah you bet, and we'll talk to you, dear listener next time on dot net rocks. Dot net Rocks is brought to you by Franklin's Net and produced by Poop Studios, a full service audio, video and post production facility located physically in New London, Connecticut, and

54:06

of course in the cloud online at pwop dot com. Visit our website at dt n et r ocks dot com for RSS feeds, downloads, mobile apps, comments, and access to the full archives going back to show number one, recorded in September two thousand and two. And make sure you check out our sponsors. They keep us in business. Now go write some code CE next time you got

Transcript source: Provided by creator in RSS feed: download file

Episode description

Transcript