What exactly is Open Source AI? (Changelog Interviews #578)

00:00

Welcome back friends, this is the Changelog, I'm Adam Stukovic and this week we're joined by Steffan Maffoli, the Executive Director of the Open Source Initiative, the OSI. The Open Source Initiative is responsible for representing the idea and the definition of Open Source globally. Steffan O shares the challenges they face as a US-based organization with a global impact.

00:39

We discuss the work, Steffan O and the Open Source Initiative are doing to define open source AI and why we need and accept it and share definition. Of course we also talk about the potential impact if a poorly defined open source AI emerges from their efforts. I also want to mention that Steffan O was feeling under the weather for this conversation, but he powered through because of how important this topic is.

01:04

A massive thank you to our friends and our partners at fly.io, the home of Changelog.com. It's simple, launch apps near users. They transform containers into micro VMs that run in their hardware in 30 plus regions on 6 continents. Launch an app for free at fly.io. What's up friends? This episode of the Change Logos brought to you by our friends over at Versel. And I'm here with Lee Robinson VP of product.

01:43

Lee I know you know the tagline for Versel, Developed Preview Ship, which has been perfect, but now there's more after the ship process. You have to worry about security, observability, and other parts of just running an application production. What's the story there? What's beyond shipping for Versel? Yeah, you know when I'm building my side projects or when I'm building my personal site, it often looks like Developed Preview Ship. You know, I try out some new features.

02:08

I try out a new framework. I'm just hacking around with something on the weekends. Everything looks good. Great. I ship it. I'm done. But as we talked to more customers, as we've grown as a company, as we've added new products, there's a lot more to the product portfolio of Versel nowadays to help pass that experience. When you're building larger, more complex products, and when you're working with larger teams, you want to have more features, more functionality.

02:31

So, tangibly, what that means is features like our Versel Firewall product to help you be safe and to have that layer of security features like our logging and observability tool so that you can understand and observe your application and production, understand if there's errors, understand if things are running smoothly, and get alerted on those.

02:49

And also, then, really an expansion of our integration suite as well, too, because you might already be using a tool like a data dog, or you might already be using a tool at the end of this software development lifecycle that you want to integrate with to continue to scale and secure and observe your application. And we try to fit into those as well, too. So, we've kind of continued to bolster and improve the last mile of delivery. That sounds amazing.

03:15

So, who's using the Versel platform like that? Can you share some names? Yeah, I'm thrilled that we have some amazing customers like Underarmor and Nintendo, Washington Post, Zapier, who use Versel's running cloud to not only help scale their infrastructure, scale their business and their product, but then also enable their team of many developers to be able to iterate on their products really quickly and take their ideas and build the next great thing. Very cool.

03:42

With zero configuration for over 35 frameworks, Versel's running cloud makes it easy for any team to deploy their apps. Today, you can start with a 14-day free trial of Versel Pro, or get a customized enterprise demo from their team. Visit Versel.com slash change log pod to get started. That's V-E-R-C-E-L.com slash change log pod. Well, stuff notes been a while. Actually, never, which is a good thing, I suppose, but now we're here. Fantastic.

04:39

We've read all things open recently, and we try to sync up with you. We missed the message, and so we got to get you on the podcast. Obviously, this show, the change was born around open source. I find it strange and sad that we've never had anybody from the open source initiative on this podcast. I'm glad you're here to change that, so welcome. Thank you for having me. It's a pleasure. Sorry, we missed each other and it's our Carolina. It was a great event. Oh, man. We love all things open.

05:14

We love Todd and their team there. We think all things open is the place to be at the end of the year. If you're a fan of open source, you're an advocate of open source, and just the way that it's permeating all of software, it's one. Open source is one, and now we're just living in a hopefully mostly open source world. Absolutely. I mean, just last week was an article published that estimated the... The value of open source software as a whole. The numbers are incredible.

05:48

The researchers from Harvard Business School went and looked at the value of open source as it is consumed or produced and it put dollar numbers on it. I envy those people because I don't know how. I'm not an analyst. Chair, maybe you're like a somewhat end of an analyst, right? You have an end of the whole brain from how I know of you. Okay. I don't know how you would quantify the value of open... I know it's quite valuable, but literally how do you quantify the value of open source?

06:18

What do they do? What are the metrics they key off of? They count the lines of code. They counted the hours. They estimated the hours that it would take to rewrite from scratch all the software that is in years. They used the data sets that are available already with some of those counts. Having those two data sets, they estimated the value that it would take to replicate all of the open source software that is available. They put the numbers around $8.8 trillion.

06:49

Wow. I would actually just say all the dollars, really. Personally, I would just say all the dollars. Yeah. I mean, it's a huge number. All the dollars. It doesn't every dollar today really depend on open source at some layer. It really couldn't be just all the dollars. It's an impressive number. It's really hard to picture it, how big it is. I had to look it up. It's three times as much as Microsoft market cap. It's larger than the whole of the United States budget.

07:22

Like, 2023 budget in the United States that includes the Mediterranean. It's hard to be. Medicare, 6.3 trillion. Yeah. A lot of trillion's there. Right. More trillion's than I've got, Jared, of anything. Yeah. I've got trillions of anything, really. Maybe not even in cents. I get a trillion cents. I don't think so. You don't keep a bucket. I almost asked Siri to tell me to turn those into the bank and see what I'll give you. That's fun to think about, really.

07:48

Well, here in number, like, 8.8 trillion. And I start to think, why don't you round that up to nine? And then I realize that's like a fifth of a trillion dollars if you're going to round it. That's a lot of money to round. That is a nice rounding error in your favor if it was your own dollars, right? Oh, yeah. I don't know if you wouldn't mind that. For sure. Yeah, round it off. Hand it out to some folks. Hand it out to some maintainers. That would be nice.

08:12

Yeah. Well, I don't know if everybody listening to this podcast will be, I think a lot of them will be. But in light of recent feedback, Jared, I don't want to assume that our listenership is super informed of what the open source initiative is. I can kind of read from the about page, definitely, but I'd prefer that you kind of give us a taste of what the OSI is really about. What is the organization? It's a 501C3. You know, it's a public benefit corporation in California.

08:40

But what exactly is the open source initiative for all that value we just talked about? What is it? Oh, yeah. In a nutshell, we are the maintainers of the open source definition. And that's the open source definition is a 10 points checklist that has been used for 26 years, we have celebrated 25 years last year. It's the checklist that has been used to evaluate licenses.

09:06

That is legal documents that come together with software packages to make sure that the packages, the software comes with the freedoms that are written down as can be summarized as four freedoms come from the free software definition. It is the freedom to use the software without having to ask for permissions, the freedom to study and to make sure that you know and to understand what it does and what it's supposed to be doing and nothing else. And for that, you need access to the source code.

09:40

And then the freedom to modify it, to tricks it and increase its capacity or help yourself. And the freedom to make copies that is for yourself or for to help others. And those freedom where we're written down in the 80s by the free software foundation and the open source initiative started a couple of decades after that picking up the principles and spreading them out a little bit in a more practical, a somewhat more practical way.

10:13

In a time, at a time when a lot of software was being deployed and powering the internet basically, this definition and the social license, licenses, gives users and developers clarity about the things that they can do, provides that agency and independence and control. And all of that clarity is what has propelled and generated that huge ecosystem that is worth 8.8 trillion. So who formed the initiative and then how did it sustain and continue?

10:50

Seems like the definition is pretty set, but like what is the work that goes on continually? Yeah, well, the work that goes on continuously is, especially now recently, it's the policy, the monitoring of policy works and everything that goes around it.

11:08

The concept of open source seems to be set, but it's constantly under threat because evolution of technology changes of business models, the rise and the rise of importance and power of new actors constantly shifts and and tend to push the definition itself of open source in different directions, the meaning of open source in different directions. And regulation also tends to introduce hurdles that we need to be aware of.

11:40

The organization, what we do, we have three programs, one is called the legal and licenses program. And that's where we maintain the definition, we review new licenses as they get approved and we also keep a database of licensing information for packages because also the developers don't use the right words or miss some pieces, a lot of packages don't have the right data and we have, we are maintaining the community that maintains this machine called clearly defined.

12:14

On the policy front, that's another program, the policy and standards front. We monitor the activity of standard setting organizations and the activity of regulators in the United States and Europe mostly to make sure that all the new laws and rules and the standards can be implemented with open source code and the regulation doesn't stop or doesn't block the development of distribution of open source software.

12:40

And then the third program is on advocacy and outreach and that's the activities that we do with maintaining the blog, having the communication, running events. And in this program, we're also hosting the conversations around the Chinese open source AI, which is a requirement that came out especially a couple of years ago, getting rapidly glowing of hotness at us.

13:09

So we basically force to start this process because open AI is a brand new system, the brand new activity is it forces us to review the principles to see if they still apply and how to they need to be modified or we can apply to AI systems as a whole. And we are charity organization, you mentioned that. So our sponsors are individuals who donate to become members and they can donate any amounts from $50 a year up to what have you. We have a few hundreds of those, almost a thousand.

13:46

And then we have corporate sponsors who give us money also donations to keep this work going. It's in their interest to have an independent organization that maintains the definition. And having multiple of these donors, corporate donors makes the organization stronger. So we don't depend on anyone seeing individually of them. So despite the fact that we get money from Google Ramads on or Microsoft and GitHub, we don't have to swear our own agencies to them.

14:19

Do you also defend the license so far as going to court with people who would misuse it or no? It hasn't happened, but we do have, I mean, not under my watch, but we do have experts and on our board and in our circle of licensing experts, we do have lawyers who go to court constantly to defend the license, the train train more, protect the producers. And there there is like expert witnesses. Exactly.

14:51

And we do provide, we have provided briefs for courts, opinion pieces for regulators and responses to request for information in various legislation here. How challenging is it to be a US-based founded idea now organization that represents and defends this definition that really, you know, going back to the trillion, so I mean, all the money, all the dollars, like it's a world problem. It's not just a United States problem. How does this organization operate internationally?

15:29

Which challenges do you face as a US-based nonprofit, but representative of the idea of open source that really impacts everyone globally? Yeah, that's a very good question and got to this challenging. So I started at the organization only a little over two years ago and I'm Italian. And so I do have connections to Europe and knowledge about Europe. We do have board members that are based in Europe and other board members in the United States.

15:58

And it is actually quite challenging to be involved into this global conversations because now a little bit like maybe in the in the late 90s, open sources becoming increasingly getting at the center of geopolitical challenges and not because of open source per se, but because software is so incredibly existing everywhere and most of that software that exists is open source.

16:25

So there have been a lot of challenges to as the relationship, the trade relationship with other actors like Russia, Ukraine, now with the war in Israel and Gaza and the trade wars with China between China and the United States, there are a lot of geopolitical issues that we are at the center of and we're finding really complicated. And in fact, we do have, we have raised our money to increase our visibility on the policy front.

16:59

We have right now, at the moment we have two people working one in Europe and one is more focused in the United States, both of them are part time, but we do have budget to hire at least another one, if not two policy analysts to help us review the incredible amount of legislation that is coming, we're just talking about in the United States.

17:22

I guess even one more layer than that is that I don't know if it's a self-profession of the defender ship of the term of open source, I understand where it came from to some degree, and I wonder if how do you all handle the responsibility of not so much owning the trademark term of open source, but defending it. So in a way, you kind of own it by defending it because you have to defend it. It's some version of responsibility, which is maybe a by-product of ownership, right?

17:53

There's a pushback happening out there. There's even a conversation of recent where they can't describe their software as open source because the term means something, and we all agree on that, right? We understand that. I'm not trying to defend that, but how do you operate as an organization that defends this term? Yeah, I mean, this is really funny because we don't have a trademark on the term open source of my software.

18:20

We have a soft power that is given to us by all the people who, just like you just said, we've recognized that the term open source is what we have to sign. We have to find. We maintain the definition. It's kind of recursive if you want, but corporations, individual developers, it's all their institutions like academia, researchers, they recognize that open source means exactly those, the list of licenses, those ten points, which you want the four freedoms that are listed.

18:55

And we maintain that, and it just become quite visible, even in courts, where they do understand that if someone is like there was a recent case involving the company, Neo4j, and during that litigation that is quite complicated and entrenched, I'm not a lawyer, I'm not going to dive into legal things.

19:21

But the one key takeaway that is easy for me to drop in communicating is that the judge recognized that the value of open sources in the definition that we maintain, and any calling open source something that is not a license, that is not a license that we have been approved is false advertising. And that hell up in court. Oh, yeah. And so that what you would say to people who are perhaps, maybe Nonshelaun, isn't the best word, but unimpressed by open source as a definition.

19:57

And they think it's stodgy and tight. And the thing that they're doing is close enough. And they like the term, they're going to use the term, and they've got open-ish code or source available or business source. Because there's a lot of people that are kind of pushing nodges against the definition itself, but like against the idea that we need a definition, or like you guys get to have the definition. What do you say to them? Yeah. You know, they're self-serving.

20:23

They try to be self-serving, and they're trying to destroy the comments that way. Quite visibly, I think that users see through them. And it's not even in their interest, but they're, you know, how it works. Sometimes corporations, their greed goes up to, they care only about the next quarter. And who cares about what happens next? You know, maybe the next CEO will have to take care of meanwhile, and just going to laugh all the way to the bank.

20:51

And that is the approach that I see many of these people who complain or who try to read the fine open source because it doesn't serve their purpose. What we maintain, it doesn't fully serve their purpose. So instead of respecting the comments and the share of the ideas, they, you know, they act like bullies and find all sorts of excuses to read the sign. We've seen it happening. I've been in free software and open source, supposed to my career since I was in my 20s.

21:22

And I've seen what was happening with the early days with the proprietary, unique guys that were going around turning us. This Linux thing is never going to work. You're joking. You're giving away. Then they started to be scared and started saying, yeah, you're giving away your jewels. You know, why are you doing this? Debriving us of our life support, the families are going to be, we're going to be begging on the street. I remember having this conversation with a sales guy from Moscow.

21:53

And Microsoft, you know, coming up with a program in the night, early 2000s, the shared source program because they just could not get their wrap their head around the thought that you could make money sharing your source code. But they were forced by the market to show at least a little bit of what was happening behind the scenes that they were doing deals. So we've seen it already.

22:18

They're going to keep on going like this, but there is plenty of interest in maintaining plenty more forces on the other side to maintain, then to keep the bar straight, to keep going where we're going because that clarity is such a powerful, such a powerful instrument to be able to say, I'm open source, therefore, I know what I can do. I know what I cannot do and how that collaboration is straightened up.

22:47

You know, the legal departments, the compliance departments, the public tenders, they're all tend to have a very clear and speedy review of processes that instead, if everyone has a different understanding of what open source means, yeah, we go back to the brunt, right? I mean, Italy now, and I'm surprised to see a lot of Starbucks stores opening. And I'm absolutely baffled. Why is this happening? This country has plenty of all every quarter. So there's a coffee with a decent coffee.

23:21

Why do you need a brunt? Because people have been going around traveling the world. They see the brunt are recognized. They know what they can do. They're going to get what they're going to get and they go out there. And it's insane with open source. What's up friends? This episode is brought to you by our friends at Sanadia. Sanadia's helping teams take NAS to the next level via a global, multi-cloud, multi-geo, and extensible service.

24:11

Fully managed by Sanadia, they take care of all the infrastructure, management, monitoring, and maintenance for you. So you can focus on building exceptional distributed applications. And I'm here with VP of Product and Engineering, Byron Ruth. So Byron, in the NAS versus Kafka conversation. I hear a couple different things. One I hear out there, I hate Kafka with a passion. That's quoted by the way on Hacker News. I hear Kafka is dead, long live Kafka.

24:39

And then I hear Kafka is the default, but I hate it. So what's the deal with NAS versus Kafka? Yeah, so Kafka is an interesting one. I've personally followed Kafka for quite some time ever since the LinkedIn days. And I think what they've done in terms of transitioning the landscape to event streaming has been wonderful. I think they definitely were the first market for persistent data streaming. However, over time, as people have adopted it, they were the first market. They provided a solution.

25:11

But you don't know what you don't know in terms of you need this solution, you need this capability. But inevitably, there's also all this operational pain and overhead that people have come to associate with Kafka deployments. Based on our experience and what users and customers have come to us with, they would say, we are spending a ton of money on, spend on a team to maintain our Kafka clusters or manage services or something like that.

25:39

The paradigm of how they model topics and how you partition topics and how you scale them is not really in line with what they fundamentally want to do. And that's where NAS can provide, as we refer to its subject-based addressing, which has a much more granular way of addressing messages, sending messages, subscribing to messages, and things like that, which is very different from what Kafka does.

26:05

And the second that we introduced persistence with our Jetstream subsystem, as we refer to it, a handful of years ago, we literally had a flood of people saying, can I replace my Kafka deployments with this NAS Jetstream alternative? And we've been getting constant imbalance, constant customers asking, hey, can you enlighten us with what NAS can do?

26:28

And oh, by the way, here's all these other dependencies like Redis and other things, and some of our services-based things that we could potentially migrate and evolve over time by adopting NAS as a technology, as a core technology to people's systems and platforms. So this has been largely organic. We never, from day one, with our persistence layer Jetstream, the intention was never to say, we're going to go after Kafka.

26:52

But because of how we layered the persistence on top of this really nice Pub-Sub Coronats foundation, and then we promoted it, and we say, hey, now we have the same semantics, same paradigm with these new primitives that introduce persistence in terms of streams

27:08

and consumers, the floodgages opened, and everyone was frankly coming to us and wanting to simplify their architecture, reduce costs, operational costs, get all of these other advantages that NAS has to offer, that Kafka does not whatsoever, or any of the other similar offerings out there. And you get all these other advantages that NAS has to offer. So there's someone out there listening to this right now. They're the Kafka cluster admin. The person in charge of this cluster going down or not.

27:35

They manage the team, they fill the pain, all the things. Give a prescription. What should they do? What we always recommend is that you can go to the NAS website, download server, look at the client and model a stream, there's some guides on doing that.

27:50

We also have, Senated provided basically a packet of resources to inform people because we get, again, so many amount requests about how do you compare NAS and Kafka, and we're like, let's actually just put a thing together that can inform people how to compare and contrast them.

28:06

So we have a link on the website that we can share, and you can basically go get those set of resources, this includes a very lengthy white paper from an outside consultant that did performance benchmarks and stuff like that and discuss basically the different trade-off set that are made, and they also do a total cost of ownership assessment between people who are organizations running Kafka versus running NAS for comparable workloads. Well, there you go. You have a prescription.

28:36

Check for a link in the show notes to those resources. Yes, Rastec is not cutting it, NAS powered by the global multi-cloud, multi-geo, and extensible service that is fully managed by Senated. It's the way of the future. Learn more at senated.com slash change log. That's s-y-n-a-d-i-a dot com slash change log. So last year on this time, Meta released Lama, their large language model, and too much fanfare and applause, and they announced it as open source.

29:20

We know a lot has transpired since then, but at the time, what was your response to that even personally or as the executive director of the OSI? Like, what were you thinking? What were you doing in the wake of that announcement? Well, we were already looking at open source AI in general. We were trying to understand what this new world meant and what impact was on the principles of open source as they applied to new artifacts that are being created in AI.

29:49

We already had come to the conclusion that open source AI is a different animal than open source software. There are many, many differences. So we immediately, two years ago, over two years ago, there was one of the first things that I started was to really push the board and to push the community to think about AI as a new artifact that required and deserved also a deep understanding and a deep analysis to see how we could transport the benefits of open source software into this world.

30:23

That are not so Lama too kind of cemented that idea. It is a completely new artifact because they have released, sure, they have released a lot of the, a lot of information and a lot of details. But for example, we don't know exactly what went into the training data. And well, Lama too also came out with a license that really has a lot of restrictions on use.

30:49

So it's a, having restrictions on use is one of the things that we don't like at, I mean, the open source definition for bits, you can know how many restrictions it needs. And you know, a surface value, the license for Lama too seems innocent, right? One of the things says, well, you cannot use Lama too for commercial applications if you have more than a few million, I don't remember exactly how many, few million active users, monthly active users.

31:17

Okay. You know, maybe that's, you know, that's a fair limitation. And in my mind, I was like, so what does it mean that the government of India cannot use it, the government of Italy, maybe, you know, if you, if you want to embed this into, so that's already an exclusion and starts to have to think about it, you know, think about it.

31:39

I must start off, you know, I must not think of what happens when you get to the six million users when, you know, all of a sudden you have to lower up and change completely in your processes. And then there are a couple of other restrictions inside that license. There are even more in the sense of surface, but when you start diving deeper, like you cannot do anything illegal with it. Okay. All right.

32:01

So let me say if I help someone decide whether they can or they should have an abortion or, you know, if I want to, if you want to have this tool use and in applications help me, I don't know, get refugees out of ward zones into another place and maybe I'm considered a terrorist organization by the diet government that is using that. So are we doing something illegal depends on whose side, you know, who needs to be evaluating that.

32:32

It's these licensing terms that the open source initiative really doesn't think they're useful, they're valuable and they should not be part of a license. They should not be part of a contracting general. And they need to be dealt out at a separate level. So that's what I was looking at. I was like, oh, Lama too. Oh my God. It's not open source because clearly this licensing thing would never pass our approval. And at the same time, we don't even know exactly what open source means.

33:02

Why are you flipping the space? So I got, I wouldn't really upset. Yeah. So then do you spring into action? Like what does the OSI do? Is your defenders of the definition and here's a misuse of huge public misuse. Do you set, do you write a blog post? Do you send a letter from a lawyer? What do you do? I call it. We were already called Cullock. Luckily we were already into this two-year process of defining open source AI.

33:29

So we have, actually, I was already in conversations with better to have them join the process and support the process to find the sheer definition of open source AI. And in fact, they're part of this conversation.

33:43

Then I'm having with not just corporations like Google Microsoft, GitHub, their Amazon, et cetera, but also we invited researchers and academia creators of AI experts of ethics and philosophy organizations that deal with open in general, but knowledge, open data, like Wikimedia, comments, open knowledge foundation, and Godzilla foundation.

34:10

And we're talking also with a bunch of expert in ethics, but also organizations like digital rights groups, like the ESS and other organizations around the world who are helping into this debate. Like we had to first go through an exercise to understand and come to a share agreement that AI is a different thing that software.

34:33

And we went through an exercise to find the shared values that we want to have represented and why we want to have the same sort of advantages that we have for software also posted over to the AI system. And then we have identified the three things that we want to have exercised. And now we're at the point where we are trying to listing, making the list of components of AI systems, which is not as simple as by the code, compiler, compiler, and source code. So it's as simple as that.

35:12

It's a lot more complicated. So we're building this list of components for specific systems. And the idea is by the end of the spring or the summer to have the equivalent of what we have now as a checklist for legal documents for software and have the equivalent for AI systems, elder components so that we will know basically we have at least candidate for an open source AI definition.

35:39

You mentioned that and there's, I think you posted this eight days ago, a new draft of the open source AI definition version 0.0.5 is available. I'm going to read from I think what you might be alluding to, which is this like exactly what is open source AI. And it says linked up to the hack MD document.

35:58

It says what is open source AI to be open source and AI system needs to be available under legal terms that grant the freedoms to one use the system for any purpose and without having to ask permission to study how the system works and inspect its components. Three modify the system for any purpose, including to change its output. And four share the system for others to use with or without modifications for any purpose.

36:25

So those seem to be the four hinges that this what is open source AI is hinge upon at least and its current draft. Is that is that pretty accurate considering it's recent eight days ago. Yeah, those are the four principles that we want to have represented. Now the very crucial question is what comp snacks is what is it you are familiar with the four freedoms of for software those set by the result of foundation in the late eighties.

36:52

They have one those freedoms have one little sentence attached to it to the freedom to study and the freedom to modify their both say access to the source code is a precondition for this which really means that to clarify it's that little addition meant to clarify that fact that if you want to study a system if you want to modify it you need to have a way to make modifications to it that is not just and it's the preferred form to make modification from the human perspective.

37:23

It's not that you give me a binary and then I have to decompile it or try to figure out from reverse engineering how it works. Give me the source code I need the source code here to study for the AI systems we haven't been the found yet in shared understanding or or share the agreement on of what it needs to have access to the preferred form to make modification to an AI system. That's the exercises we're running now and we yeah.

37:53

That's interesting the preferred form of modifications really interesting as like you said you don't want to give a binary and expect reverse engineering because that's possible right and that's possible maybe to a small subset it's not the preferred route to get to Rome it's just like that's not the route I want to go down right I want a different way.

38:11

And you want to have a simple way so you know even some licenses even how the more specific wording around defining what source code actually means like the engineering GPL is one of those very clear description and prescriptions about what needs to be given to users in order to exercise those freedoms. They're shrires as a user.

38:34

So for AI yeah for AI there is it's complicated because there are few new things that for which we don't even have there are no court cases yet you know I keep repeating the same story when software came out for the first time I started to come out at the labs research labs. I started to become a commercial artifact that people could just sell. There was a conscious decision to apply copyright to it. It was not a given fact that it was going to be using copyright like copyright law.

39:09

So that decision was in place of like you one honestly and it was a well thought out I don't know which of the two because copyright as a legal system is very similar across the world and building the open source definition different software definition. The legal documents they go with software for open source software and presoft to those legal documents built on top of copyright means that they're very very similarly applied pretty much everywhere around the world.

39:39

The alternative that the tire was a commercial where conversations around treating software as an invention and therefore covered by patents. Pat and Plo is a whole different mess around the world. They all differ in applications. They have all different terms much more complicated to deal with. So for AI we're pretty much at the same stage where there are some new artifacts like the model after you train a model and that produces weights and parameters they go into the model.

40:11

Those models are actually it's not clear what kind of legal frameworks apply to those things and we might be at the same time in history where we could have to imagine and think maybe suggest and recommend but the best course of action will be whether it makes sense to treat them as copyrightable entities artifacts or nothing at all or inventions or any you know some other rights or exclusive flight.

40:40

And the same goes into the other big conversation that is happening already but for which there is no I don't have a clear view of where it's going to end. Is there the conversations around the right to data mining and if you follow the conversations around judge you be being sued by your times and get images stability AI soon get images and get out being sued by anonymous etc etc. a lot of those lawsuits hinge on what's happening

41:14

why are these powerful corporations going around and calling the internet aggregating all of this information and data that we have provided uploaded we society some commercial actors some non commercial actors we have created this wealth of data on the internet and

41:34

they're going around painting it and basically make a proprietary I'm building models that they have for themselves and on top of that you can already start seeing like oh my god there's got to be eventually making a lot of money out of the things that we have

41:48

created or even more scared like sometimes I think about this myself I've been uploading my pictures for many years without thinking too much so there is a lot of days out there I'm sure that someone has built another base of their my pictures as I was aging and now

42:04

these pictures of being can be used could be used by a new government or a new act to recognize me around the streets at any time and I don't know how they did a course how is that fair is that no fair those are big questions and there isn't easy a simple answer yeah so did you enumerate and I missed it or can we enumerate the components that you have decided so far are part of an AI system the code I heard the training data etc yeah there are three main categories so maybe four like one is the

42:43

air is it the category of data one is that in the category of code what is the other category is models and there is a four category that goes into other things like documentation for example instructions about to use or scientific papers in the data parts some of the components

43:03

are the training data the testing data in the code parts go the tooling to like for the architecture the inference code to run the model anything that is written by human in general the vehicles have in there the code to filter and are set up the data sets and for repair

43:25

them for the training and then in the models you have the model architecture the model parameters including weights hyper parameters and things like that there might be intermediate steps during the training and the last bit is documentation how to samples output so there is an initial list

43:48

of all of these components that have been I worked we worked with the or actually the Linux foundation worked on creating this list for specifically for generative AI and large language models and we're working with them I mean we're using this various as a backdrop or as a starting

44:08

point to move forward this conversation now the question that we need to ask out having this list and if you go to the draft five you will see an empty matrix basically from this to components there are 16 if I remember correctly 17 this is a component and then on a row next to them there

44:27

is a question do I need it to run it go I needed to use it I mean do I need to use it do I need to copy do I need to study to I need this component to modify the system and we're referring to the system like this is one of the important thing is the open source definition refers to the program

44:46

and the program is never defined but the program in pretty much we know what it is AI yeah is and again this is a very complicated question looks very simple on surface but when you start diving a little bit deeper it becomes complicated because what is an AI system right so

45:05

we started using the definition that has been it's becoming quite popular in every regulation around the world it's it's a work done by the organization for economic cooperation and development the OECD and they have defined AI system in very broad terms and this definition is

45:27

being used in many regulations like from the United States executive order on AI NIST also uses it in Europe the AI act uses it it'll go with a slight very small minor variation it seems to be quite popular but there are the tractors and indeed it is quite generic too it's sometimes when

45:51

you read that it carefully made even color spreadsheet it's really bizarre so let's say that hypothetically I'm like a medical company that has been working on a large language model and I have proprietary data so I have like readings and reports and stuff that we've accumulated over

46:12

years and I create an LLM based on that data that ultimately can answer questions about medicine or whatever and I want to open source that I need to be able to make it so it's usable, studyable, modifiable and shareable and it seems like the training data even though that's the

46:32

most proprietary part and perhaps the most difficult part to actually make available or sometimes impossible is necessary not to use but to study and modify it seems like so if I release the model the code all the parameters everything we use to build a model everything except for like the

46:53

source original data under what you answer currently working on that would not be open source AI would it honestly that is a very good case example for why I think we need to carefully reason around what exactly do I need to study what kind of access what sort of access do I need

47:14

is that the original data set because if it is the original data set then we will never going to have an open source AI right that's where I'm getting to yeah it's not going to happen it's not going to happen yeah so maybe and this is why working on policies that are throughout

47:31

there maybe what we need is a very good description of what that data is maybe samples maybe instructions on how to replicate it because for example there might be data that is copyrighted you might have the right on the fur use or on the different exclusions of copyright you may have the rights to create a copy and create a derivative like I run the training but not to redistribute it if you redistribute then you start infringing so I think we need to be carefully thinking about

48:04

that also and the reason why I became more and more convinced that we don't need the original data set is because I've seen wonderful mixing wonderful remixing of models even splitting of models and recombination of models creating whole new capabilities new AI capabilities without having to

48:33

retrain a single thing so I'm starting to believe really that the AI weights the machine learning the weights and the architecture has it so it's not a binary code it's not a binary system that the binary code that you have to reverse engineer if you have sufficiently detailed instructions on how

48:53

it's been built and what went into it you should be able you might be able to create new systems and reassemble it study how it works and executing it modify so the preferred fault to make modifications is not necessarily going through the pipeline or rebuilding the whole system from scratch

49:14

which for many reasons may be impossible I do like the idea of a small subset of the data set you know that's anonymized or you know sanitized in some way shape or formed it's like this is the acceptable sample amount required for the study portion or the modification portion yeah

49:34

you know it could be the scheme of example it could be the right you know provide your own data in here if you if you can which you can obviously find other ways to use artificial intelligence to generate more data so that's that's a whole thing right but I feel like that's that's acceptable to

49:50

me yeah to provide some sort of sampling or as you said the schema I think that makes sense to me yeah yeah there are the research is going also in the district shared with data cards and model cards lots of metadata specifications I do think that that might be a valuable option I

50:08

would love to have I mean we've seen the next few weeks and month how that conversation goes but I do believe that that's one way that we can get out of this get out of this process with a definition that is not just the identical something beautiful that you put up in a picture in museum and nobody

50:26

nobody can do anything with it it needs to be practical like I keep repeating the open source definition how success because because it enabled something practical and it has success because others other people have written it other people have decided to use it if you she keep on insisting

50:45

from your pedestal that you should go or shown she'll do this and that it may not be finding an optimal crowds to this to that follow here right yeah and then if no one's using it what's the point right you kind of what's the point you lost the thread

51:22

what's up friends I'm here with one of my new friends Zane Hamilton from C.I.Q so Zane we're coming up on a hard deadline with the centa's end of life later this year in July and there are still folks out there considering what their next move should be then last year we had a bunch of change

51:37

around red hat enterprise Linux that makes it quote less open source in the eyes of the community with many saying rello's open source but where is the source and why can't I download and install it now rocket Linux is fully open source and C.I.Q is a founding support partner that offers paid

51:55

support for migration installation configuration training etc but what exactly does an enterprise or a Linux to sab and get when they choose the free and open source rocky Linux and then ultimately the support from C.I.Q if they need it there's a lot going on in the enterprise Linux space today

52:13

there's a lot of end of life of centos people are making decisions on where to go next the standard of what an enterprise Linux looks like tomorrow is kind of up in the air what C.I.Q is doing is we're trying to to help those people that are going through these different decisions that they're

52:26

having to make and how they go about making those decisions and that's where our expertise really comes into play a lot of people who have been through very complex Linux migrations be it from the old days of migrating from a I X or Solaris on to Linux and even going from version to

52:40

version because to be honest enterprise Linux version to versions not always been an easy conversion it hasn't been and you will hear that from us typically the best idea is to do an in-place upgrade not always a real easy thing to do but what we've done is we have started looking

52:53

at and securing a path of how can we actually go through that how can we help a customer who's moving from centos 7 because of the end of life in July of this year what does that migration path look like and how can we help and that's where we're looking in ways to help automate from an

53:05

admin perspective if you're working with us we've been through this we can actually go through and build out that new machine and do a lot of the the back end manual work for you so that all you really have to do at the end of the day is validate your applications up and running in the new space

53:18

and then we automate this to switch over so we've we've worked through a lot of that there's also the decisions you're making around I'm paying a very large bill for something I'm not necessarily getting the most value out of I don't want to continue down that path we can help you make that shift

53:31

over to an open source operating system rocky Linux and help drive what's next help you be involved in a community and help make sure that that environment you have is stable it's going to be validated by the actual vendors that you're using today and that's really where we want to be a is a partner from not just an end user perspective but as an industry perspective we are working with a lot of those top tier vendors out there of certifying rocky making sure that it gets pushed back

53:56

the RISF making sure that we can validate that everything is there and secure that needs to be there and helping you on that journey of moving and that's where we see IQ really show our value on top of an open source operating system as we have the expertise we've done this before we're in the

54:11

trenches with you and we're defining that path of how to move forward. Okay ops and sys admin folks out there what are you choosing centa's is end of life soon you may be using it but if you want to support partner in the trenches with you in the open source trenches with you check out

54:27

c iq that the founding support partner of rocky Linux they've stood up the rsf which is the home for open source enterprise software the rocky enterprise software foundation that is they've helped to orchestrate the open el a collaboration created by anapel by c iq oracle and suce check out rocky linux at rocky linux.org the rsf at r esf.org and of course if you need support check out our friends at c iq at c iq.com.

55:12

fully acknowledging that it's a work in progress and you're not done given your current mental model of the definition as it is working are there systems out there today that you would rubber stamp and say like this is open source ai i'm thinking of perhaps mistral has a bunch

55:27

of stuff going on and they're committed to open and transparent but i don't know exactly what that means for them have you looked at anything and i have you do you have like things you're comparing against as you build to make sure that there's a set of things that exist or could exist that are

55:41

practical not yet i know that there is we have an affiliate organization called eluther ai there are a group of researchers they recently incorporated as a file one c three not broke into your state and so i'm the very beginning they've been doing a lot of research in the

56:00

open releasing data sets and structure and then research papers models and weights and everything i like that so i'm looking i'm really leaning a lot on them to shine a light on how this can be done but i don't want to be too restricted in my mind like they are very open with a with an open

56:21

science and open research mentality i think that there is an open ai and open source ai that is not as equally open necessarily but it can still practically have meaningful impact we can generate that positive reinforcement of innovation per initialize collaboration etc so yes i need

56:45

on eluther ai but i'm also very open and i'm sure there will be other organizations other groups are there as we go and elaborate more on what we actually need to what is preferred form to make modifications to an ai system that we're going to discover more so no open source ai yes so

57:06

there's no ever stand for anything out there currently well i mean i said i could run for stamp pt and and the illiterre i but i don't want to say that that's necessarily the only thing right there very more stuff and again these those are the ones the guys that i mean because i know

57:22

how they work yesterday or day or day all of them was released by the alenea institute and that seems to be also quite openly available short models weights science behind it etc i haven't looked at their licenses and i haven't looked at carefully so i can't really tell it might always

57:41

well be an open source ai system i'm trying to get to a definitive really is there is there not an a stand open source ai out there yet you know there is why i can tell you what is not i mean my mind too is not okay open ai is not too sure all right a denialist more than a permit list yes so

58:02

i suppose one of the question which maybe is obvious but i got to ask it is what is the benefit if i'm building a model and i'm releasing a new ai what is the benefit to it being open source to to meet this open source ai definition like what is the benefit to its originator and then

58:22

obviously it's a humanity i kind of get that but like what's the benefit it's pretty easy to kind of clarify that with software right we see we see how that's working because we've got you know 30 years of history or more in in a lot of cases like we've got we've got track record there

58:37

we don't have track record here it's still early pioneer days what's the benefit that is a very good question and i i don't know an answer for it i mean i do i i know the benefit for humanity i know the benefit for the science of it and and this is what really that those benefits are what

58:57

trigger the internet like if software started to come out at the labs without the definition of true software without the gpn license without the bsd research i don't think we would have had such a fast evolution of software computer science we would not have the internet that we see today

59:18

if everyone had to buy a license from solari is sun from Oracle and etc etc if at that the center without out you know you would have to go and call the some ecosystems or IBM's sales team to be before you could build a data center instead of using just boxes and

59:39

slapy grimics and Apache web server on it we would have had a completely different history of digital world the past i mean completely different so i can see the benefit for society and science for some of these corporations i'm assuming that they have made their some of their

59:56

calculations on stopping the competition or feeling competitive advantages maybe in pure Silicon Valley approach like get more users will figure out the business model later there is some of that going on likely most likely but i i can't i haven't had that conversation yet with any smart people

01:00:19

i know pretty about the business models behind us so the possible ways of privatize or i don't find any strings and things like that from these open source model yeah do you think that they're becoming commoditized if we specifically talk about these large language models if we call AI

01:00:37

that for now recognizing as a umbrella term and those other things that also that represents do you think that they are becoming commoditized and will continue to enough so that open source can keep up with proprietary in terms of quality or even surpass just because of the number of

01:00:56

people releasing things and are they you know i don't know i'm asking honestly what are your thoughts on it obviously the recently i saw this new system that it detects the speech system and they built it this team of developers from our company called polybora they built a system

01:01:15

by splitting a system from open AI another from either from a core now don't remember exactly but they split at high system they they took it and they flipped it their input for outputs and they attached another model of their own training with small data sets and they built a brand

01:01:35

music i think i mean this is the kind of stuff that is inspiring like at one point there's going to be i'm sure that the quick evolution of this discipline would make it so that smaller teams with smaller amount of data would be able to create very powerful machines and maybe the advantages

01:01:55

of these large corporations are now deploying delivering and distributing are openly accessible AI models maybe in their mind having optimized hardware cloud resources that they can sell maybe that's where they're going it's well there there are many streams they imagine that they

01:02:16

were calling to be coming from yeah that is exciting i did see i think it was like code again AI just recently announced a model that beats deep mind on code generation you know the according to benchmarks that i haven't looked at as well as copilot and that's from a smaller player

01:02:35

i'm not sure if that's open or closed or what but it is kind of pointing towards like okay there's significant competition and like you said remixing and the ability to combine and change and even in some cases swap out and take the best results that we will have a vibrant ecosystem

01:02:56

of these things and i think open source is the best model for vibrant ecosystems so that rings true with me because it means right but it sounds right yeah this is a tough one this is really a tough not the crack really i mean you even at the the forums you have i believe you're calling it the

01:03:17

deep dive right it's deep dive colon AI is and you this is the place where you're hoping that many folks can come and organize you say it's the global multi stakeholder effort to define open source AI and that you're bringing together various organizations and individuals to collaboratively

01:03:35

write a new document which is what we've been talking about you know directly and indirectly who else is invited this like how does this get around how do people know about this who is invited to the table to define or help define is this is in an open way to define it what is

01:03:53

happening yeah who's participating but at this point it's now public so anyone can really join the forum and can join me in the biweekly downhole meetings so that part is is public and everybody welcome to join we're going to keep on going with public reports and small working groups

01:04:16

with people that were picking but only because of agility in the collaborations we want to have that we're picking people that we know also or that we have been in touch with coming from a variety of experiences say we're talking to clear of AI in academia large corporations small corporations

01:04:37

start up lawyers people who work with regulators think tanks and lobbying organizations we're talking to experts in other fields like ethics and philosophy we keep on chatting with we have identified six stakeholders categories and we're trying to have representations also geographically distributed

01:05:02

from you know North America South America Asia Pacific Europe Africa last year we had conversations with about 80 people from representatives of all these categories in a private group just to get things kick started and we have had meetings in person starting in June in San Francisco and in

01:05:28

July in postland and other meetings in in Bilbao in Europe like we had meetings in person with some of these people doing different conferences but starting this year we're going to be the first half of the year we're going to be super public we're going to gather we're going to be

01:05:46

publishing all the results of the working groups and we're going to be taking comments on the forums and then we're going to have an in-person meeting we're aiming they may already June with at least two representatives for each of the stakeholders to get in a room and produce you know I don't know

01:06:06

the last the last cases in definition you know we're moving on the comments and come out with other that meeting with our ins candidates something that we feel like there is endorsement from Ados and different organizations across the world and across the kids then we're going to

01:06:23

use and we're racing plants for it to have at least four events in different parts of the world between June and the end of October one of these events will definitely be at all things open what we're going to we're going to gather more potential endorsements as soon as we get to five

01:06:43

endorsements from each of the different categories I think we're going to be able to say this is version one we can start working with it and see what we're not and maybe next year we're going to have by that time I mean biocutal window the board will also have the process for the maintenance

01:07:02

of this definition because most likely we're going to have to think about how to maintain it how to respond to challenges whether they're technological or regulatory challenges or just we missed a mark and we'll realize later we'll have to fix it yeah kind of want to backtrack slightly I guess

01:07:26

as I hear you talk about this and kind of coming to the you know a version of blast sometime this year based upon certain details like when I asked you and I know this is your response and not so much a corporate response in terms of what's the benefit of being an open source artificial

01:07:42

intelligence like what is what's the benefit of being open source a i like all this effort to define it and then what if what if there's not that many people who really want to be defined by it like I guess that's an interesting consideration is that all this effort to define it but maybe

01:07:58

there is no real benefit or the benefit is unclear and then folks just it's almost like saying it's definitely a line right it's like well okay everything is basically not and there's very few that are basically or at least initially and maybe as iteration and progress happens that more

01:08:13

more will see the benefit and maybe that benefit permeates more clearly than we can see it now yeah I don't want to think about that okay I don't want to think about that yeah no it's one of those things like if I you should start any of that or thinking about the feeling or you're

01:08:31

probably going to fail right so it's it's not one of the outcomes that I see tremendous amount of pressure I mean it's unlikely that that's going to happen that's what I the but what I want to say it I have had a lot of pressure from corporations regulators like the AI act has a provision in there

01:08:52

the text it says that provides some exclusions to the mandates of the law for open source AI there's no definition in there so you know regulators need that largest bulk corporations need it researchers need some clarity they would I hear a lot of researchers they won't they don't

01:09:15

and they wouldn't data it doesn't mean that they won't necessarily be original data some of them at least but they do want to have good data set and that only comes if there is a clarity about what are the the boundaries of what is allowed for them to to accumulate data because data becomes

01:09:35

very very messy very quickly privacy law copyright law trade secrets illegal content you know content is illegal in some parts of the country or in some countries and some other countries is not you know it becomes really really messy very quickly and researchers don't have a way to deal with

01:09:56

it right now they need help I agree that you should keep doing it I didn't mean it sound like it should be a failure sometimes I think it might be beneficial I think about failure at the beginning because it's like well you got to consider your exit before you can go in in a way

01:10:10

I'm not saying you should do that but I'm glad you are defining it does need to be defined I didn't mean to be necessarily like what if but you know there's a lot of effort going into this I can see how you know a lot of your attention is probably spent simply on defining this and working with all

01:10:28

the folks all the stakeholders all the opinion makers etc that are necessary to define what it is it's a lot of work it's all work and you're absolutely right this is taking most of my attention and yes I do see a couple of standard all checks like we we can sail if we're late and if we get

01:10:48

it wrong but for getting it wrong the fact that it's defined with a version number I think we can fix it over time and we really shouldn't be expecting to have a perfect first time because it's changing too quickly the whole lab escape and the other getting inmate is also part of the reason why

01:11:11

I'm pushing to get something out of the door because a lot of pressure exists in the market to have something and every everyone is calling them their models open source AI recognizing that there is value in that term implicitly but if there is no clarity it's going to be denoted very very

01:11:33

rapidly before Jared and I got on this call one thing we had a loose discussion then I quickly stopped talking because we have a term I think it's pretty well known in broadcasting and podcasting is like don't waste tape right and I didn't want to share my deep sentiment although I loosely

01:11:49

mentioned it to Jared and our pre-call just kind of 10 minutes before we met up was basically what is it steak I know we talked you know just loosely here about failure as an option and what is failure and is it iterative on the version numbers you just mentioned but is there a bigger

01:12:05

concern at stake if the definition that you come up with collectively is not perfectly suited like does the term open source in software now is the term now fractured because the arbiter of the term open source has not been able to carefully and accurately define open source AI like is there

01:12:27

a bigger loss that could happen and I'm sorry to have to ask that question but I have to yeah yeah you don't want me to sleep tonight I think that's I mean I think so far we've been able to win in quotes win in the public when we push back on the term of open source because it's pretty

01:12:50

well accepted right yeah and whether and I want to say this but whether we like it or not OSI has been the guardian so to speak of that term some say you've taken that right I think you've been given that right over decades of trust and then some cases there's some mistrust and that's

01:13:11

not so much me it's just out there in the there not everybody's been happy with every decision you come up with and that's going to be the case right if you're not making some money because you're you're not doing some things right I suppose in the world because now you're going to like your

01:13:22

choices right right but I think I wonder that I personally wonder if you can't define this well does the term open source change or is becoming open to change there is that is come aware but that's one of the reasons why I'm on being extra careful to make sure that everyone's involved

01:13:42

and have a voice and has the chance to voice their opinion and all of these opinions are recorded publicly so we can go back and you know point out the place where we meet up by choice and and you know be able to correct or or not yeah Stefano real quick what's the number one place people

01:14:01

should go if they were to get involved like the URL here's how you can be part of that discussion discuss the open source the org there we go as well we're going to be having yeah all our conversations all right you heard it that'll be in the show notes so if you are interested in this even

01:14:18

if you just want to listen and be lurking and watching as it makes progress definitely hit that up if you want your voice heard and you want to help Stefano and his team make this definition awesome and encompassing and successful yes I think the more voices the better the earlier on the better

01:14:35

so that we can have a great open source of the idea definition thank you thanks Stefano appreciate your time thank you so much thank you it's a big question mark what the future of the open source AI definition will be well the first draft of the open source AI definition is linked in the show

01:14:56

notes I highly encourage you to check this out dig in learn about what's happening here voice your opinion if you have a strong opinion but definitely pay attention as you can hear with some of the uncomfortability with the questions we asked about what happens if the open source AI

01:15:13

definition falls a little short or what the ramifications are or potential impact might be I think we all need to pay close attention to how this definition evolves and lands links are in the show notes so check them out and again thank you to Stefano because he did have a cold during this

01:15:33

conversation and he powered through because he knew this was an important conversation to have here on this podcast and to share with you so thank you Stefano up next on the pod is our friendly turned friend Jamie Tanna coming up on friends and next week it's about making your shell magical

01:15:50

with Ellie Huxdebel talking about a toon check it out at attuin.sh okay once again a big thank you to our friends and our partners that fly.io our friends at type sense.org and of course our friends at century.io use the code change law to get a hundred dollars off the team plan you could

01:16:11

do so at century.io okay BMC those beats are banging we have that album out there dance party I don't know about you but I've been dancing a lot more because that album has been on repeat on all my places that I listen to music so I've been dancing a lot dance party is out there check it out at changelaw.com slash beats that's it this show's done thank you for tuning in we'll see ya soon

✨ This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.

Episode description

Transcript

What exactly is Open Source AI? (Changelog Interviews #578)

Episode description

Transcript ✨

Transcript