Why ML Needs a New Programming Language with Chris Lattner - podcast episode cover

Why ML Needs a New Programming Language with Chris Lattner

Sep 03, 20251 hr 13 minSeason 3Ep. 25
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Summary

Chris Lattner, creator of LLVM and Swift, discusses his latest venture, Modular, and the programming language Mojo. He highlights the fragmentation in the AI compute landscape and the need for a unified, high-performance, and portable software stack for modern accelerators. Mojo aims to provide Pythonic usability with the low-level control and performance typically found in C++, leveraging advanced metaprogramming and a robust type system to address the challenges of heterogeneous hardware. The conversation also explores Modular's business model and the future of Mojo as a Python extension and potential Rust replacement.

Episode description

Chris Lattner is the creator of LLVM and led the development of the Swift language at Apple. With Mojo, he’s taking another big swing: How do you make the process of getting the full power out of modern GPUs productive and fun? In this episode, Ron and Chris discuss how to design a language that’s easy to use while still providing the level of control required to write state of the art kernels. A key idea is to ask programmers to fully reckon with the details of the hardware, but making that work manageable and shareable via a form of type-safe metaprogramming. The aim is to support both specialization to the computation in question as well as to the hardware platform. “Somebody has to do this work,” Chris says, “if we ever want to get to an ecosystem where one vendor doesn’t control everything.”

You can find the transcript for this episode on our website.

Some links to topics that came up in the discussion:

Transcript

Introduction and Early Computing Passion

Welcome to Signals and Threads, in-depth conversations about every layer of the tech stack from Chainstreet. I'm Ron Minsky. It is my great pleasure to have Chris Latner on the show. Typically, on signals and threads, we end up talking to engineers who work here at Jane Street.

But sometimes we like to grab outside folk. And Chris is an amazing figure to bring on because he's been so involved in a bunch of really foundational pieces of computing that we all use. LLVM and Clang and MLIR and OpenCL. Swift, and now Mojo. And this has happened at a bunch of different storied institutions, Apple and Tesla and Google and Sci-5 and now Modular. So anyway, it's a pleasure to have you joining us, Chris. Thank you, Ron. I'm so happy to be here.

From Basic to Compiler Engineering

I guess I want to start by just hearing a little bit more about your origin story. How did you get into computing and how did you get into this world of both compiler engineering and programming language design? So I grew up in the 80s, and back before computers were really a thing, I mean, we had PCs, but they weren't considered cool. And so I fell in love with understanding how the computer worked. And back then, things were way simpler. I started with a basic interpreter first.

example, and get a book from the store. Remember when we had books? And you learn things from books. Did you do the thing where you get the hobbyist magazine and copy out the listing of the program from it? That's exactly right. And so we didn't have vibe coding, but we did have books. And so just by typing things in, you could understand how things work. And then when you broke it. Because inevitably you're...

typing something in and you don't really know what you're doing, you have to figure out what went wrong. And so it encouraged a certain amount of debugging. I really love computer games. Again, back then, things were a little bit simpler. Computer games drove graphics and performance and things like this. And so I spent some time on these things called

board systems and the early internet reading about how game programmers were trying to push the limits of the hardware. And so that's where I got interested in performance and computers and systems. I went on to college and had an amazing professor at my school. Shout out to University of Portland in Portland, Oregon. compiler nerd. And so I think that his love for compilers was infectious. His name was Steven Vegdahl.

And that caused me to go on to pursue compilers at University of Illinois. And there again, continue to fall down this rabbit hole of compilers and systems and build LLVM. And ever since I got into the compiler world, I loved it. I love compilers because they're large scale systems. There's multiple.

different components that all work together. And in the university setting, it was really cool in the compiler class, just because unlike most of the assignments where you do an assignment, turn it in, forget about it, in compilers you would do an assignment, turn it in, get graded, and then build on it.

It felt much more realistic, like software engineering, rather than just doing a project to get graded. Yeah, I think for a lot of people, the OS class are their first real experience of doing a thing where you really are building layer on top of layer. I think it's an incredibly important experience for people as they start engineering.

It's also one where you get to use some of those data structures. I took this almost academic, here's what a binary tree is, and here's what a graph is. And particularly when I went through it, it was taught from a very math-forward perspective, but it really made it useful. And so that was actually really cool. I'm like, oh, this is why.

Compiler Engineering and Language Design

I learned this stuff. So one thing that strikes me about your career is that you've ended up going back and forth between compiler engineering and language design space. Whereas I feel like a lot of people are on one side or the other, you know, they're mostly compilers people and they don't care that much about the...

language and just how do we make this thing go fast. And there are some people who are really focusing on language design. And the work on the compiler is a secondary thing towards that design. And you've both popped back and forth. And then also a lot of your compiler engineering work, really starting with LLVM.

in some senses itself, very language forward. LLVM, there's a language in there that's this intermediate language that you're surfacing as a tool for people to use. So I'm just curious to hear more about how you think about the back and forth between compiler engineering and language design. The reason I do this is that... that effectively my career is following my own interests.

My interests are not static. I want to work on different kinds of problems and solve useful problems and build into things. The more technology and capability you have, the higher you can reach. With LLVM, for example, built and learned a whole bunch of cool stuff.

about deep code generation for an XA6 chip, that category of technology with register allocation and stuff like this. But then it made it possible to go say, let's go tackle C++ and let's go use this to build the world's best implementation of something that lots more people use and understand.

deep backend code generation technology. And then with Swift, it was build even higher and say, okay, well C++, maybe some people like it, but I think we can do better and let's reach higher. I've also been involved in AI systems, been involved in building an iPad app to help teach kids how to code. And so lots of different things over time. And so for me, the place I think I'm most useful and where a lot of my experience is valuable ends up being at this hardware-software boundary.

Swift's Genesis and Design Philosophy

I'm curious how you ended up making the leap to working on Swift. From my perspective, Swift looks from the outside like one of these points of arrival in mainstream programming contexts of a bunch of ideas that I have long thought are really great ideas in other programming languages. And I'm curious.

in some ways a step away from like, oh, I'm going to work on really low-level stuff and compiler optimization, and then we'll go much higher level and do a C++ implementation, which is still a pretty low level. How did the whole Swift thing happen? Great question. I mean, the time frame for people that aren't familiar is that LLVM started in 2000. So by 2005, I had exited university and I joined Apple. And so LLVM was kind of an advanced research project at that point.

was much more mature, and we had just shipped C++ support in Clang. And so it could bootstrap itself, which means the compiler could compile itself. It's all written in C++. It could build advanced libraries like the Boost template library, which is... super crazy advanced template stuff. And so the C++ implementation that I and the team had built was real. Now, C++, in my opinion, is not a beautiful programming language. And so implementing it is a very interesting technical challenge for me.

For me, a lot of problem solving ends up being, how do you factor the system the right way? And so Clang has some really cool stuff that allowed it to scale and things like that. But I was also burned out. We had just shipped it. It was amazing. I'm like, there has to be something better.

The Academic vs. Utility Divide

And so Swift really came starting in 2010. It was a nights and weekends project. It wasn't like a top-down management said, let's go build a new programming language. It was a Chris being burned out. I was running a 20 to 40 person team at the time, being an engineer during the day and being a technical.

leader, but then needing an escape hatch. And so I said, okay, well, I think we can have something better. I have a lot of good ideas. Turns out, programming languages are a mature space. It's not like you need to

invent pattern matching at this point. It's embarrassing that C++ doesn't have good pattern matching. We should just pause for things. I think this is a small but really essential thing. I think the single best feature coming out of languages like ML in the mid-70s is, first of all, this notion of... algebraic data type, meaning every programming language on Earth has a way of saying this and that and the other. A record or a class or a tuple. A weird programming language, I think it was...

Barbara Liskov? Yeah, and she did a lot of the early theorizing about what are abstract data types. But the ability to do this or that or the other, to have data types that are a union of different possible shapes of the data, and then having this pattern matching facility that lets you...

basically, in a reliable way, do the case analysis so you can break down what the possibilities are, is just incredibly useful. And very few mainstream language have picked it up. I mean, Swift, again, is an example, but languages like ML, SML, and Haskell, and OCaml.

That's right, SML, standard ML. It's been there for a long time. I mean, pattern matching, it's not an exotic feature. Here we're talking about 2010. C Sharp didn't have it, C++ didn't have it, obviously Java didn't have it. I don't think JavaScript had it. None of these mainstream languages had it.

but it's obvious. And so part of my opinion about that, and so by the way, I represent this engineer. I'm not actually a mathematician. And so type theory goes way over my head. I don't really understand this. The thing that gets me frustrated about the academic approach to programming languages is that people approach it by saying there's some types and there's intersection types and there's these types and they don't start from utility forward.

pattern matching, when I learned OCaml, it's so beautiful. It makes it so easy and expressive to build very simple things. And so to me, I always identify to the utility. And then, yes, there's amazing formal type theory behind it, and that's great, and that's why it actually works and composes.

Bringing that stuff forward and focusing on utility and the problems it solves and how it makes people happy ends up being the thing that I think moves the needle in terms of adoption, at least in mainstream.

Design Process and Learning from Complexity

Yeah, I mean, I think that's right. My approach also, and my interest in the language is also very much not from the mathematical perspective. Although, you know, my undergraduate degree is in math. I like math a lot, but I mostly approach...

these things as a practitioner. But the thing I've been struck by over the years is the value of having these features have a really strong mathematical foundation is they generalize and, as you were saying, compose much better. If they are in the end mathematically simple, you're way more likely to have a feature

that actually pans out as it gets used way beyond your initial view as to what the thing was for. That's right. Well, and see, this is actually a personal defect because I don't understand the math in the way that maybe theoretically would be ideal. I end up having to rediscover certain truths that are...

The cliche of the Russian mathematician invented it 50 years ago. And so a lot of what I find is that I can find truth and beauty when things compose and things fit together. And often I'll find out it's already been discovered because everything in programming

language has been done. There's almost nothing novel. But still that design process of saying, let's pull things together. Let's reason about why it doesn't quite fit together. Let's go figure out how to better factor this. Let's figure out how to make it simpler these days. That process, to me, I think is kind of like...

working on physics I hear. The simpler the outcome becomes, the more close to truth it feels like it is. And so I share that. Maybe it's more design gene or engineer design combination, but it's probably what you mathematicians actually know. inherently, and I just haven't figured it out yet.

Do you find yourself doing things after you come to it from an engineering perspective, trying to figure out whether there are useful mathematical insights? Do you go back and read the papers? Do you have other PL people who are more mathematically oriented who you talk to? How do you extend your thinking to cover some of that other stuff?

The problem is math is scary to me. I see Greek letters and I run away. I do follow Archive and things like this, and there's a programming language section on that. I get into some of it, but what I get attracted to in that is the examples.

the results section and the future-looking parts of it. And so it's not necessarily the how, it's what it means. And so I think a lot of that really speaks to me. The other thing that really speaks to me when you talk about language design and things like this is blog posts from some obscure...

academic programming language that I've never heard of. You just have somebody talking about algebraic effect systems for this and then the other thing or something really fancy, but they figure out how to explain it in a way that's useful. And so when it's not just, let me explain to you the type system, but it's, let me explain. I think there's a lot of value in the work that's done in papers of really like working out in detail.

the theory and the math and how it all fits together. But yeah, I think the fact that the world has been filled of a lot of interesting blog posts from the same people has been great because I think it's another modality where it often encourages you to pull out the simpler and easier to consume versions of those ideas. And I think that is just a different... kind of insight and it's valuable to surface that too. And also when I look at those blog posts sometimes they design smell.

particularly the C++ community, there's a lot of really good work to fix C++. They're adding a lot of stuff to it, and C++ will never get simpler. You can't really remove things, right? And so a lot of the challenge there is it's constrained problem-solving.

And so when I look at that, often what I'll see when I'm reading one of those posts, and again, these are brilliant people and they're doing God's work trying to solve problems with C++. Best luck with that. But you look at that and you realize there's a grain of sand in the system that didn't need to be there. remove that grain of sand, then the entire system

gets relaxed. And suddenly all these constraints fall away and you can get to something much simpler. And Swift, for example, it's a wonderful language and it's grown really well and the community is amazing. But it has a few grains of sand in it that cause it to get a lot more complicated. And so this is where I'm not just happy with things that got built.

is amazing. It's very practical, but it has lots of problems. That's when I get a chance to build a next generation system. I want to learn from that and actually

Modular and Mojo: Unifying AI Compute

try to solve these problems. So this is the great privilege of getting to work on a new language, which is a thing you're doing now, right? There's this new language called Mojo, and it's being done by this company that you co-founded called Modular. Maybe just so we understand the context a little bit, can you tell me a little bit about what is Modular? What's the basic offering? What's the business model?

Before I even get there, I'll share more of how I got here. If you oversimplify, my background did this LLVM thing, and it's foundational compiler technology for CPUs. It helped unite a lot of CPU-era infrastructure, and it provided a platform for languages like Swift.

but also Rust and Julia and many different systems that all got built on top of. And I think it really catalyzed and enabled a lot of really cool applications of accelerated compiler technology. People use LVM in databases and for query engine optimization, lots of cool stuff.

for trading or something. I mean, there can be tons of different applications for this kind of technology. And then did programming language stuff with Swift. But in the meantime, AI happened. And so with AI, brought this entirely new generation of compute.

GPUs, tensor processing units, large-scale AI training systems, FPGAs and ASICs, and all this complexity for compute. And LLVM never really worked in that system. And so one of the things that I built when I was at Google was a bunch of foundational compiler techniques.

for that category of systems. And there's this compiler technology called MLIR. MLIR is basically LLVM 2.0. And so take everything you learn from building LLVM and helping solve this, but then bring it forward into this next generation of

compiler technology so that you can go hopefully unify the world's compute for this GPU and AI and ASIC kind of world. MLR has been amazingly successful and I think it's used in roughly every one of these AI systems and GPUs. It's used by NVIDIA, it's used by Google.

Fragmented AI Software Ecosystem

It's used by roughly everybody in the space. But one of the challenges is that there hasn't been unification. And so you have these very large-scale AI software platforms. You have CUDA from NVIDIA. You have XLA from Google. You have Rock M from AMD. It's countless. Every company has their own software stack.

And one of the things that I discovered and encountered, and I think the entire world sees, is that there's this incredible fragmentation driven by the fact that each of these software stacks built by a hardware maker are just all completely different.

And some of them work better than others. But regardless, it's a gigantic mess. And there's these really cool high-level technologies like PyTorch that we all love and we want to use. But if PyTorch is built on completely different stacks and it's gluing together these megalithic...

worlds from different vendors, it's very difficult to get something that works. Right. There are both complicated trade-offs around the performance that you get out of different tools, and then also a different set of complicated trade-offs around how hard they are to use, how complicated it is to write something in them, and then what hardware you can...

target from each individual one. And each of these ecosystems is churning just incredibly fast. There's always new hardware coming out and new vendors in new places. And there's also new little languages popping up into existence. And it makes the whole thing pretty hard to wrangle.

Exactly. And AI is moving so fast. There's a new model every week. It's crazy. And new applications, new research, the amount of money being dumped into this by everybody is just incredible. And so how does anybody keep up? a structural problem in the industry. And so the structural problem is that the people doing this kind of work, the people doing code generation for advanced GPUs and things like this, they're all at hardware companies.

In the hardware companies, every single one of them is building their own stack because they have to. There's nothing to plug into. There's nothing like LLVM but for AI. That doesn't exist. And so as they go and build their own vertical software stack, of course they're focused on their hardware.

They got advanced roadmaps. They have a new chip coming out next year. They're plowing their energy and time into solving for their hardware. But we out in the industry, we actually want something else. We want to be able to have software that runs across multiple pieces of hardware.

the work is at a hardware company, it's very natural that you get this fragmentation across vendors because nobody's incentivized to go work together. And even if they're incentivized, they don't have time to go work on somebody else's chip. AMD's not going to pay to work on NVIDIA GPUs or something like this. That's true when you think about this kind of a split between low-level and high-level languages.

CUDA and AMD has Rockm, which is mostly a clone of CUDA. And then the XLA tools from Google work incredibly well on TPUs and so on and so forth. Given vendors have different things. Then there's like the high-level tools, PyTorch and Jaxx.

and Triton, and various things like that. And those are typically actually not made by the hardware vendors. Those are made by different kinds of users. I guess Google is responsible for some of these, and they are also sometimes a hardware vendor. But a lot of the time, it's more stepped back. Although even there, the cross-platform...

support is complicated and messy and incomplete. Because they're built on top of fundamentally incompatible things. So that's the fundamental nature. And so again, you go back to Chris's dysfunction and my weird career choices. I always end up back at the hardware-software boundary. And there's a lot of other folks that are

Mojo's Ambitious Goal: CUDA Successor

Really good at adding very high-level abstractions. abstractions on top of two things that don't work very well can't solve performance or liability or management or these other problems. You can only add a layer of duct tape. But as soon as something goes wrong, you end up having to debug this entire crazy stack of

stuff that you really didn't want to have to know about. And so it's a leaky abstraction. And so the genesis of Modular, bringing it back to this, was realizing there are structural problems in the industry. There is nobody that's incentivized to go build a unifying software platform and do that work at the bottom level. And so what we set off to do is we said, okay, let's go build.

And there's different ways of explaining this. You could say a replacement for CUDA. That's like a flamboyant way to say this. But let's go build a successor to all of this technology that is better than what the hardware makers are building and is portable. And so what this takes is this takes doing the work that these hardware companies are doing.

the team of saying, let's do it better than, for example, NVIDIA is doing it for their own hardware. Which is no easy feat. They've got a lot of very strong engineers and they understand their hardware better than anyone does. Beating them on their own hardware is tough. That is really hard. start, because CUDA is about 20 years old. They've got all the momentum. They're a pretty big company. As you say, lots of smart people. And so that was a ridiculous goal. Why did I do that? Well...

I mean, a certain amount of confidence in understanding how the technology worked, having a bet on what I thought we could build and the approach and some insight and intuition, but also realizing that it's actually destiny. Somebody has to do this work.

If we ever want to get to an ecosystem where one vendor doesn't control everything, if we want to get the best out of the hardware, if we want to get new programming language technologies, if we want pattern matching on a GPU, then we need at some point to do this. And if nobody else is going to do it, I'll step up and do that.

And so that's where modular came from is saying, let's go crack this thing open. I don't know how long it will take, but sometimes it's worthwhile doing really hard things if they're valuable to the world. And the belief was it could be profoundly impactful and hopefully get more people into even just being able to.

Modular's Business Model and Enterprise Value

use this new form of compute with GPUs and accelerators and all this stuff and just really re-democratize AI compute. So you pointed out that there's a real structural problem here, and I'm actually wondering how at a business model level do you want to solve the structural problem, which is the history of...

Computering is these days littered with the bodies of companies that try to sell a programming language. It's a really hard business. How is Modular set up so that it's incented to build this platform in a way that can be a shared platform that isn't subject to just one other vendor's lock-in?

First answer is, don't sell a programming language. As you say, that's very difficult. So we're not doing that. Go take Mojo, go use it for free. We're not selling a programming language. What we're doing is we're investing this foundational technology to unify hardware.

Our view is, as we've seen in many other domains, once you've fixed the foundation, now you can build high-value services for enterprises. And so our enterprise layer, often what we talk to, you end up with these groups where you have hundreds or thousands of GPUs.

It's rented from a cloud on a three-year commit. You have a platform team that's carrying pagers and they need to keep all the stuff running and all the production workloads running. And then you have these product teams that are inventing new stuff all the time. And there's new research, there's a new model that comes out and they want to get it on the production infrastructure.

But none of this stuff actually works. And so the software ecosystem we have with all these brilliant but crazy open source tools that are thrashing around, all these different versions of CUDA and libraries, all this different hardware happening, it's just a gigantic mess. And so helping solve this for...

the platform engineering team that actually needs to have stuff work and want to be able to reason about and want good observability and manageability and scalability and things like this is actually, we think, very interesting. We've gotten a lot of good response from people on that. The cost of doing this is we want to actually make it work.

where we do fundamental language compiler underlying systems technology and help bring together these accelerators so that we can get, for example, the best performance on an AMD GPU and get it so that the software comes out in the same release train as support for an NVIDIA GPU. And being able to pull that together, again, it's just multiplicatively reduces complexity, which then leads to a product that actually works, which is really cool and very novel in AI.

Why a New Language for Accelerators?

So the way that Mojo plays in here is it basically lets you provide the best possible performance and it gets the best possible performance across multiple different hardware platforms. Are you primarily thinking about this as an inference platform or how does the training world fit in? So let me zoom out.

I'll explain our technology components. I have a blog post series I encourage you and any viewers or listeners to check out called Democratizing AI Compute. It goes through the history of all the systems.

problems and challenges that they've run into, and it gets to, what is modular doing about it? And so part 11 talks about architecture, and the inside is Mojo, which is a programming language. I'll explain Mojo in a second. Next level out is, it's called Max, and so you can think of Max as being a

a PyTorch replacement or a VLLM replacement, something that you can run on a single node and then get high-performance LLM serving that kind of use case. And then the next level out is called Mammoth, and this is the cluster management Kubernetes kind of layer.

Joe, you say, your experience, you know what programming languages are. They're incredibly difficult and expensive to build. Why would you do that in the first place? And the answer is, we had to. In fact, when we started Modular, I was like, I'm not going to invent a programming language. I know that's a bad idea. It takes too long. It's too much work.

You can't convince people to adopt a new language. I know all the reasons why creating a language is actually a really bad idea. But it turns out we were forced to do this because there is no good way to solve the problem. And the problem is how do you write code that is portable? across accelerators. So that problem

I want portability across, for example, make it simple, AMD and NVIDIA GPUs. But then you layer on the fact that you're using a GPU because you want performance. And so I don't want a simplified watered down. I want Java that runs on a GPU. I want the full power of the GPU.

I want to be able to deliver performance that meets and beats NVIDIA on their own hardware. I want to have portability and unify this crazy compute where you have these really fancy heterogeneous systems and you have tensor cores and you have this explosion of complexity and innovation happening.

happening in this hardware platform layer, most programming languages don't even know that there's an 8-bit floating point that exists. And so we looked around, and I really did not want to have to do this, but it turns out that there really is no good answer. And again, we decided that, hey, the stakes are high. We want to do something.

impactful, we're willing to invest. I know what it takes to build a programming language. It's not rocket science, it's just a lot of really hard work and you need to set the team up to be incentivized the right way. But we decided that, yeah, let's do that.

The ML Hardware Landscape

So I want to talk more about Mojo and its design, but before we do, maybe let's talk a little bit more about the pre-existing environment. I did actually read that blog post series. I recommend it to everyone. I think it's really great. And I want to talk a little bit about what the existing ecosystem of languages looks like. But even before then... Can we talk more about the hardware? What is the space of hardware look like that people want to run these ML models on?

Yeah, so the one that most people zero in on is the GPU. And so GPUs are, I think, getting better understood now. And so if you go back before that, though, you have CPUs. So modern CPUs in a data center. Often you'll have, I mean, today, you guys are probably writing. quite big iron, but you got 100 cores and a CPU, and you got a server with two to four CPUs on a motherboard.

And then you go and you scale that. And so you've got traditional threaded workloads that have to run on CPUs, and we know how to scale that for internet servers and things like this. If you get to a GPU, the architecture shifts. And so they have basically these things called SMs. And now the programming model is that you have effectively much more medium-sized compute that's now put together on much higher performance memory fabrics.

the programming model shifts. And one of the things that really broke CUDA, for example, was when GPUs got this thing called a Tensor Core. And the way to think about a Tensor Core is it's a dedicated piece of hardware for matrix multiplication.

And so why do we get that? Well, a lot of AI is matrix multiplication. And so if you design the hardware to be good at a specific workload, you can have dedicated silicon for that and you can make things go really fast. There are really these two quite different models sitting in.

side of the GPU space. Of course, the name itself is weird. GPU is graphics processing unit, which is what they were originally for. And then this SM model is really interesting. They have this notion of a warp, right? A warp is a collection of typically 32 threads that are operating together kind of in...

lockstep always doing the same thing a slight variation on what's called the SIMD model same instruction multiple data it's like a little more general than that more or less you can think of it as the same thing

And you just have to run a lot of them. And then there's a ton of hardware inside of these systems basically to make a switching between threads incredibly cheap. So you pay a lot of silicon to add extra registers. So the context switch is super cheap. So you can do a ton of stuff in parallel.

Each thing you're doing is itself like 32 wives parallel. And then because you can do all this very fast context switching, you can hide a lot of latency. And that worked for a while. And then we're like, actually, we need way more of this matrix multiplication stuff. And you can sort of do reason. efficient matrix multiplication through this warp model, but not really that good. And then there's a bunch of quite idiosyncratic hardware which changes its performing characteristics.

from generation to generation just for doing these matrix multiplications, right? So that's sort of the NVIDIA GPU story. And the Volta is like V100 and A100 and H100. They just keep on going and changing pretty materially from generation to generation in terms of... the performance characteristics, and then also the memory model, which keeps on changing.

You go back to intuition, CUDA was never designed for this world. CUDA was not designed for modern GPUs. It was designed for a much simpler world. And CUDA being 20 years old, it hasn't really caught up. And it's very difficult because, as you say, the hardware keeps changing. And so CUDA was designed... Now, if you get beyond GPUs, you get to Google TPU and many other dedicated AI.

systems, they blow this way out and they say, okay, well, let's get rid of the threads that you have on a GPU and let's just have matrix multiplication units and have really big matrix multiplication units and build the entire chip around that. And you get much more specialization, but you get a much higher

Challenges of Existing GPU Languages

throughput for those AI workloads. Going back to why Mojo? Mojo was designed from first principles to support this kind of system. Each of these chips, as you're saying, even within NVIDIA's family, the Volta to Ampere to Hopper to Blackwell, these things are not compatible with each other.

Actually, Blackwell just broke compatibility with Hopper, so it can't run Hopper kernels always on Blackwell. Oops. Well, why are they doing that? Well, AS software is moving so fast, they decided that was the right triumph to make. And meanwhile, we all software people need the ability to target.

at this. When you look at other existing systems, with Triton, for example, their goal was, let's make it easier to program a GPU, which I love, that's awesome. But then they said, we'll just give up 20% of the performance of the silicon to do it. Wait a second. I want all the performance. So if I'm using a GPU, GPUs are quite expensive, by the way.

I want all the performance, and if it's not going to be able to deliver the same quality results you'd get by writing CUDA, well then you're always going to run into this headroom where you get going quickly, but then you run into a ceiling and then have to switch to a different system to get full performance.

really trying to say, solve this problem where we can get more usability, more portability, and full performance of the silicon, because it's designed for these wacky architectures like Tensor Cores. And if we look at the other languages that are out there, there's languages like...

CUDA and OpenCL, which are low-level, typically look like variations on C++, in that tradition are unsafe languages, which means that there's a lot of rules you have to follow. And if you don't exactly follow the rules, you're in undefined behavior land. It's very hard to read. And just let me make fun of my C++ heritage, because I've spent so many years. You just have a variable that you forget to initialize. It just shoots your foot off. It's just unnecessary violence to programmers.

performance better because the idea is c++ and its related languages don't really give you enough information to know when you're making a mistake and they want to have as much space as they can to optimize the programs they get so the stance is just if you do anything That's not allowed. We have no obligation to maintain any kind of reasonable semantics.

or debuggability around that behavior. And we're just going to try really, really hard to optimize correct programs, which is a super weird stance to take because nobody's programs are correct. There are bugs and undefined behavior in almost any C++ program of any size. And so you're in a very strange position in terms of the guarantees that you get from the compiler system you're using.

Well, so, I mean, I can be dissatisfied. I can also be sympathetic with people that work on C++. So again, I've spent decades in this language and around this ecosystem and building compilers for it. So I know quite a lot about it. The challenge is that...

C++ is established. And so there's tons of code out there. By far, the code that's already written is the code that's the most valuable. And so if you're building a compiler or you have a new chip or you have an optimizer, your goal is to get value out of the existing software.

And so you can't invent a new programming paradigm that's a better way of doing things and defines away the problem. Instead, you have to work with what you've got. You have a spec benchmark you're trying to make go fast. And so you invent some crazy heroic hack that makes some important benchmark work.

because you can't go change the code. In my experience, particularly for AI, but also I'm sure within Jane Street, if something's going slow, go change the code. You have control over the architecture of the system. And so what I think the world really benefits from... unlike benchmark hacking, is languages that give control and power.

AI Coding and Language Adoption Shift

and expressivity to the programmer. And this is something where I think that if you, again, you take a step back and you realize history is the way it is for lots of structural and very valid reasons, but there are reasons that don't apply to this new age of compute. Nobody has a workload that they can

and pull forward to next year's GPU. Doesn't exist. Nobody solved this problem. I don't know the timeframe, but once we solve that problem, once we solve portability, you can start this new era of software that can actually go forward. And so now... To me, the burden is make sure it's actually good. And so to your point about memory safety, don't make it so forgetting to initialize a variable is just gonna shoot your foot off.

produce a good compiler error saying, hey, you forgot to initialize a variable, right? These basic things are actually really profound and important in the tooling and all this usability. And this DNA, these feelings and thoughts are what flow into Mojo. And GPU programming is just a very different... world from traditional CPU programming.

Just in terms of the basic economics and how humans are involved, you end up dealing with much smaller programs. You have these very small, but very high-value programs whose performance is super critical. And in the end, a relatively small coterie of experts who end up...

And so it pushes you ever in the direction you're saying of performance engineering, right? You want to give people the control they need to make the thing behave as it should. And you want to do it in a way that allows people to be highly productive. And the idea that you have an enormous amount of legacy code that you need to bring over, it's like... actually you kind of

Don't. The entire universe of software is actually shockingly small. And it's really about how to write these small programs as well as possible. And also there's another huge change. And so this is something that I don't think that the programming language community has recognized yet. but AI coding has massively changed the game.

Because now you can take a CUDA kernel and say, hey, Claude, go make that into Mojo. And actually, how good have you guys found the experience of that, of doing translation? Well, we do hackathons, and people do amazing things, having never touched Mojo, have never done GPU programming. And within a day...

they can make things happen that are just shocking. And so now AI coding tools are not magic. You cannot just Vibe code DeepSeaCar1 or something, right? But it's amazing what that can do in terms of learning new languages, learning new tools and getting into and And so this is one of the things where, again, you go back five or 10 years, everybody knows nobody can learn a new language and nobody's willing to adopt new things, but the entire system has changed.

Mojo's Core Design: Pythonic Metaprogramming

So let's talk a little bit more in detail about the architecture of Mojo. What kind of language is Mojo and what are the design elements that you chose in order to make it be able to address this set of problems? Yeah, again, just to relate how different the situation is back when I was

working on Swift, one of the major problems to solve was Objective-C was very difficult for people to use. And you had pointers and you had square brackets and it was very weird. And so the goal in the game of the day was invent new syntax and bring together modern programming language features to build. a new language. Fast forward to today, actually some of that is true. AI people don't like C++. C++ has pointers and it's ugly and it's a

40-year-old plus language and has actually the same problem that Swift had to solve back in the day. But today there's something different, which is that AI people do actually love a thing. It's called Python. And so one of the really important things about Mojo is it's a member of the Python family.

And so this is polarizing to some, because yes, I get it that some people love curly braces, but it's hugely powerful because so much of the AI community is Pythonic already. And so we start out by saying, let's keep the syntax like Python and only diverge from that if there's a really good reason. But then what are the good reasons?

Well, the good reasons are we want, as we were talking about, performance, power, full control over the system. And for GPUs, there's these very important things you want to do that require meta-programming. And so Mojo has a very fancy metaprogramming system kind of inspired by this language called Zig that brings runtime and compile time together to enable really powerful library designs. And the way you crack open this problem with tensor cores and things like this.

Metaprogramming for Performance

is you enable really powerful libraries to be built in the language as libraries instead of hard coding into the compiler. Let's take a little bit to the metaprogramming idea. What is metaprogramming and why does it matter for performance in particular? Yeah, it's a great question. And I think you know the answer to this too. and I know you're a fan, but...

Secretly, we are also working on metaprogramming features in our own world. Exactly. And so the observation here is when you're writing a for loop in a programming language, for example, typically that for loop executes at runtime. So you're writing code that when you execute the program, it's the instructions that the computer will follow.

to execute the algorithm within your code. But when you get into designing higher level type systems, suddenly you want to be able to run code at compile time as well. And so there's many languages out there. Some of them have macro systems. C++ has templates. What you end up getting is you end up getting in many languages this duality between what happens at runtime and then a different language almost that happens at compile time.

is most egregious because templates. You have a for loop in runtime, but then you have unrolled recursive templates or something like that at compile time. Well, so the insight is, hey, these two problems are actually the same. They just run at different times. And so what Mojo does is...

It says, let's allow the use of effectively any code that you would use at runtime to also work at compile time. And so you can have a list or string or whatever you want in the algorithms to go do memory allocation, deallocation, and you can run those at compile time.

enabling you to build really powerful, high-level abstractions and put them into libraries. So why is this cool? Well, the reason it's cool is that on a GPU, for example, you'll have a tensor core. Tensor cores are weird. We probably don't need to deep dive into all the reasons why. But the indexing...

And the layout that TensorFlow is used is very specific and very vendor different. And so the TensorFlow you have on AMD or the TensorFlow you have on different versions of NVIDIA GPUs are all very different. And so what you want is you want to build as a GPU programmer a set of abstractions.

you can reason about all of these things in one common ecosystem and have the layouts much higher level. And so what this enables, it enables very powerful libraries and very powerful libraries where a lot of the logic is actually done at compile time, but you can...

bug it because it's the same language that you use at runtime and it makes the language much more simpler, much more powerful and just be able to scale into these complexities in a way that's possible with C++ but in C++ you get some crazy template stack trace that is

maddening and impossible to understand. In Mojo, you can get a very simple error message. You can actually debug your code, a debugger, and things like this. So maybe an important point here is that metaprogramming is really an old solution to this performance problem. Maybe a good way of thinking about this is, imagine you have

some piece of data that you have that represents a little embedded domain-specific language that you've written that you want to execute via a program that you wrote, you can, in a nice high-level way, write a little interpreter for that language that just, you know, I have maybe a Boolean expression language or who knows what else.

Maybe it's a language for computing on tensors in a GPU. And you could write a program that just executes that mini domain-specific language and does the thing that you want. And you can do it. but it's really slow. Writing an interpreter is just inherently slow because all this interpretation overhead where you are dynamically making decisions about what the behavior of the program is. And sometimes what you want is you just want to actually emit exactly the code that you want.

and boil away the control structure and just get the direct lines of machine code that you want to do the thing that's necessary. And various forms of code generation let you get past in a simpler way, lets you get past all of this.

control structure you have to execute at runtime and instead be able to execute it at compile time and get this minified program that just does exactly the thing that you want so that's a really old idea goes back to all sorts of programming languages a lot of lisps that did a lot of this meta programming stuff but then

The problem is this stuff is super hard to think about and reason about and debug. And that's certainly true if you think about in C, all this macro language, if you use the various C preprocessors to do this kind of stuff in C, it's pretty painful to reason about. And then C++ made it.

richer and more expressive but still really hard to reason about and you write a C++ template and you don't really know what it's going to do or if it's going to compile until you give it all the inputs and let it go and it feels good in the simple case but then when you get to more advanced cases suddenly complexity compounds and it gets out of hand. And it sounds like the thing that you're going for in Mojo is it feels like one language.

It has one type system that covers both the stuff you're generating statically and the stuff that you're doing at runtime. Sounds like debugging works in the same way across both of these layers. But you still get the actual runtime behavior you want from...

Usability, Performance, and Predictability

a language that you could more explicitly just be like, here's exactly the code that I want to generate. And size zero into metaprogramming as one of the fancy features. One of the cool features is it feels and looks like Python, but with actual types. Right. Right. And let's not forget the basics.

Having something that looks and feels like Python, but it's a thousand times faster or something is actually pretty cool. For example, if you're on a CPU, you have access to SIMD, the SIMD registers that allow you to do multiple operations at a time and be able to get the full power of your hardware.

even without using the fancy features is also really cool. And so the challenge with any of these systems is how do you make something that's powerful, but it's also easy to use? I think your team has been playing with Mojo and doing some cool stuff. I mean, what have you seen and what's your experience been?

We're all still pretty new to it, but I think it's got a lot of exciting things going for it. I mean, the first thing is, yeah, it gives you the kind of programming model you want to get the performance that you need. And actually, in many ways, the same kind of programming model that you get out of something like Cutlass or QtDS.

which are these NVIDIA-specific, some at the C++ level, some at the Python DSL level. And by the way, every tool you can imagine nowadays is done once in C++ and once in Python. We don't need to implement programming languages any other way anymore. They're all either skins on C++.

or skins on python but depending on which path you go down whether you go the c++ path or the python path you get all sorts of complicated trade-offs like in the c++ path in particular you get very painful compilation times the thing you said about template metaprogramming is absolutely true messages are super bad. If you look at these more Python embedded DSLs, the compile times tend to be better.

It still can be hard to reason about, though. One nice thing about Mojo is the overall discipline seems very explicit. When you want to understand, is this a value that's happening at execution time at the end, or is it a value that you know is going to be...

dealt with at compile time it's just very explicit in the syntax you can look and understand whereas in some of these dsls you have to actively go and poke the value and ask it what kind of value it is and i think that kind of explicitness is actually really important for performance engineering making easy to understand just what precisely

you're doing. You actually see this a ton, not even with these very low-level things, but if you look at PyTorch, which is a much higher-level tool, PyTorch does this thing where you get to write a thing that looks like an ordinary Python program. But really, it's got a much trickier execution model. Python's an amazing and terrible ecosystem in which to do this kind of stuff, because what guarantees do you have when you're using Python? None. What can you do? Anything.

You have an enormous amount of freedom. The PyTorch people in particular have leveraged this freedom in a bunch of very clever ways where you can write a Python program that looks like it's doing something very simple and straightforward that would be really slow. But no, it's very carefully delaying and making some operations lazy.

Beyond 'Sufficiently Smart Compilers'

so it can overlap compute on the GPU and CPU and make stuff go really fast. And that's really nice, except sometimes it just doesn't work. This is the trap. Again, this is my decades of battle scars now. So as a compiler guy, I can make fun of other compiler people. There's this trap and it's an attractive trap, which is called the sufficiently smart compiler. And so what you can do is you can take something and you can make it look good on a demo.

And you can say, look, I make it super easy and I'm going to make my compiler super smart and it's going to take care of all this and make it easy through magic. But magic doesn't exist. And so anytime you have one of those sufficiently smart compilers, if you go back in the days, it was like auto parallelization. Just write C code as sequential logic.

And then we're going to automatically map it into running on 100 cores on a supercomputer or something like that. They often actually do work. They work in very simple cases and they work in the demos. But the problem is that you go and you're using them and then you change one thing and suddenly everything breaks. Maybe the compiler crashes.

It just doesn't work. Or you go and fix a bug, and now instead of 100 times speedup, you get 100 times slowdown because it foiled the compiler. A lot of AI tools, a lot of these systems, particularly these DSLs, have this design point of, let me... pretend like it's easy, and then I will take care of it behind the scenes. But then when something breaks, you have to end up looking at compiler dumps, right? And this is because magic doesn't exist.

and so this is where predictability and control is really i think the name of the game particularly if you want to get the most out of a piece of hardware which is how we ended up here it's funny the same issue of how clever is the underlying system you're using comes up when you look at the difference between cpus and gpus

use themselves are trying to do a weird thing where a chip is a fundamentally parallel substrate it's got all of these circuits that in principle could be running in parallel and then it is yoked to running this extremely sequential programming language which is just trying to do one thing

And then how does that actually work with any reasonable efficiency? Well, there's all sorts of clever, dirty tricks happening under the covers where it's trying to predict what you're going to do, the speculation that allows it to dispatch multiple instructions in a row by guessing what you're...

going to do in the future. There's things like memory prefetching where it has heuristics to estimate what memory you're going to ask in the future so it can dispatch multiple memory requests at the same time. And then if you look at

things like GPUs, and I think even more TPUs, and then also totally other things like FPGA. It's the field programmable gateway where you put basically a circuit design on it. It's a very different kind of software system, but all of them are, in some sense, simpler and more deterministic.

and more explicitly parallel. When you write down your program, you have to write an explicitly parallel program. That's actually harder to write. I don't want to complain too much about CPUs. The great thing about CPUs is they're extremely flexible and incredibly easy to use. All of that dark magic actually works a pretty large fraction of the time.

Yeah, remarkably well. But your point here, I think it's really great. And what you're saying is you're saying CPUs are the magic box that makes sequential code go in parallel pretty fast. And then we have new, more explicit machines, somewhat harder to program because they're not a magic box. But you get something. from it you get performance and power

Because that magic box doesn't come without a cost. It comes with a very significant cost, often the amount of power that your machine dissipates. And so it's not efficient. And so a lot of the reasons we're getting these new accelerators is because people really do care about it.

100 times faster or using way less power or things like this. I'd never thought about it, but your analogy of Triton to Mojo kind of follows a similar pattern, right? As Triton is trying to be the magic box and it doesn't give you the full performance and it burns more power and all that.

kind of stuff and so mojo is saying look let's go back to being simple let's give the programmer more control and that more explicit approach i think is a good fit for people that are building crazy advanced hardware like you're talking about but also people that want to get the best performance out of the existence

Mojo's Portable Performance Strategy

hardware we have. So we talked about how metaprogramming lets you write faster programs by boiling away this control structure that you don't really need. So that part's good. How does it give you portable performance? How does it help you on the portability front?

Yeah, so this is another great question. So in this category of sufficiently smart compilers and particularly for AI compilers, there's been Years of work and MLR has catalyzed a lot of this work building these magic AI compilers that take TensorFlow or even the new PyTorch stuff and trying to generate optimal code for some chip.

So take some PyTorch model and put it through a compiler and magically get a high performance. And so there's tons of these things and there's a lot of great work done here. And a lot of people have shown that you can take.

kernels and accelerate them with compilers the challenge with this is that people don't ever measure what is the full performance of the chip and so people always measure from a somewhat unfortunate baseline and then try to climb higher instead of saying what is the speed of light And so if you measure from speed of light, suddenly you say, okay, how do I achieve several different things? Even if you zero into one piece of silicon, how do I achieve the best performance for one use case?

And then how do I make it so the software I write can generalize even within the domain? And so, for example, take a matrix multiplication. Well, you want to work on maybe a float 32, but then you want to generalize it to float 16.

Okay, well, templates and things like this are an easy way to do this. And then programming allows you to say, okay, I will tackle that. And then the next thing that happens is because you went from float32 to float16, your effective cache size has doubled because twice as many elements fit. into cache if they're 16 bits than if they're 32 bits. Well, if that's the case, now suddenly the access pattern needs to change. And so you get a whole bunch of this conditional logic.

that now changes in a very parametric way as a result of one simple change that happened with float32 to float16. Now you play that forward and you say, okay, well, actually matrix multiplication is a recursive hierarchical problem. There's specializations for tall and skinny matrices and a dimension is one or something. There's all these special cases.

Just one algorithm for one chip becomes this very complicated subsystem that you end up wanting to do a lot of transformations to so you can go specialize it for different use cases. And so Mojo with the metaprogramming allows you to tackle that. Now you bring in... other hardware okay and so think of matrix multiplication these days as being almost an operating system and there's like so many different subsystems and special cases and

different D types and crazy float four and six and other stuff going on. At some point, they're going to come out with a floating point number so small that it will be a joke. But every time I think that they're just kidding, it turns out it's real. Seriously, I heard somebody talking about 1.2-bit floating point.

Exactly like you're saying. Is that a joke? You can't be serious. And so now when you bring in other hardware, other hardware brings in more complexity because suddenly the Tensor Core has a different layout. an AMD than it does on an NVIDIA. Or maybe to your point about warps, you have 64 threads that warp on one and 32 threads and warp on the other. But what you realize is you realize, wait a second, this really has nothing to do with hardware vendors. This is actually true even within...

For example, the NVIDIA line, because across these different data types, the tensor cores are changing. The way the tensor core works for float32 is different than the way it works for float4 or something. And so you already, within one vendor, have to have this very powerful... metaprogramming to be able to handle the complexity and do so in the scaffolding of a single algorithm like matrix multiplication. And so now as you bring in other vendors, well, it turns out, hey, they all have.

Things that look roughly like tensor cores. And so we're coming at this with a software engineering perspective. And so we're forced to build abstractions. We have this powerful metaprogramming system so we can actually achieve this. And so even for one vendor, we get this thing called layout tensor. Layout tensor is...

saying, okay, well, I have the ability to reason about not just an array of numbers or a multidimensional array of numbers, but also how it's laid out in memory and how it gets accessed. And so now we can declaratively map these things onto the hardware that you have in these abstractions. stack. And so it's this really amazing triumvirate between having a type system that works well. It's very important basis. I know you're a fan of type systems also.

You then bring in metaprogramming. And so you can build powerful abstractions that run at compile time. So you get no runtime overhead. And then you bring in the most important part of this entire equation, which is. programmers who understand the domain. I am not going to write a fast matrix multiplication. I'm sorry, that's not my experience. But there are people in that space that are just freaking brilliant. They understand exactly how the hardware works.

They understand the use cases and the latest research and the new crazy quantized format of the day, but they're not compiler people. And so the magic of Mojo is it says, hey, you have a type system, you have metaprogramming, you have effectively the full power of a compiler. you have

when you're building libraries. And so now these people that are brilliant at unlocking the power of the hardware can actually do this. And now they can write software that scales both across the complexity of the domain, but also across hardware. And to me, that's what I find so exciting, so powerful. this is it's like unlocking the power of the mojo programmer instead of trying to put it into the compiler which is what a lot of earlier systems have tried to do

Managing Complexity with Type Systems

So maybe the key point here is that you get to build these abstractions that allow you to represent different kinds of hardware, and then you can conditionally have your code execute based on the kind of hardware that it's on. It's not like an ifdef where you're picking between different hardware platforms. There are complicated data structures.

like these layout values that tell you how you can traverse data. Which is kind of a tree. This isn't just a simple int that you're passing around. This is like a recursive hierarchical tree that you need at compile time. The critical thing is you get to write a thing that feels like one synthetic program with one...

understandable behavior, but then parts of it are actually going to execute a compile time so that the thing that you generate is in fact specialized for the particular platform that you're going to run it on. So one concern I have over this is it sounds like the configuration space of your programs is going to be

massive. And I feel like there are two directions where this seems potentially hard to do from an engineering perspective. One is, can you really create abstractions that within the context of the program hide the relevant complexity so it's possible for people to think in a modular way? about the program they're building so their brains don't explode with the 70 different kinds of hardware that they might be running it on. And then the other question is, how do you think about...

testing right because there's just so many configurations how do you know whether it's working in all the places because it sounds like it has an enormous amount of freedom to do different things including wrong things in some cases how do you deal with those two problems both controlling the complexity of the abstractions and then having

a testing story that works out. Okay, Ron, I'm going to blow your mind. I know you're going to be resistant to this, but let me convince you that types are cool. Okay. I know you're going to fight me on this. Well, so this is, again, you go back to the challenges and opportunities working with either Python or C++. Python doesn't have types, really. I mean, it has some stuff, but it doesn't really have a type system. C++ has a type system, but it's just incredibly painful to work with.

And so what Mojo does is it says, again, it's not rocket science. We see it all around us. Let's bring in traits. Let's bring in a reasonable way to write code so that we can build abstractions that are domain-specific and they can be checked modularly.

And so one of the big problems with C++ is that you get error messages when you instantiate layers and layers and layers and layers of templates. And so if you get some magic number wrong, it explodes spectacularly in a way that you can't reason about. And so what Mojo does, it says,

Let's bring in traits that feel very much like protocols in Swift or traits in Rust or type classes in Haskell. Like, this isn't novel. This is like a mechanism for what's called ad hoc polymorphism, meaning I want to have some operation or function that has some meaning, but actually it's going to get...

implemented in different ways for different types. And these are basically all mechanisms of a way of giving the thing that you're doing and the types involve looking up the right implementation that's going to do the thing that you want.

Yeah, I mean, a very simple case is an iterator. So Mojo has an iterator tree and you can say, hey, well, what is an iterator over a collection? Well, you can either check, see if there's an element or you can get the value at the current element. And then as you.

keep pulling things out of an iterator, it will eventually decide to stop. And so this concept can be applied to things like a linked list or an array or a dictionary or an unbounded sequence of packets coming off a network. And so you can write code this generic. across these different called backends or models that implement this trait. And what the compiler will do for you is it will check to make sure when you're writing that generic code, you're not using something that won't work.

And so what that does is it means that you can check the generic code without having to instantiate it, which is good for compile time. It's good for user experience, because if you get something wrong as a programmer, that's important. It's good for reasoning about the modularity of these different subsystems.

Mojo Packages and Portability

because now you have an interface that connects the two components. I think it's an underappreciated problem with the C++ templates approach to the world, where C++ templates, they seem like a deep language feature, but really they're just a code generation feature. They're like C macro.

That's right. It both means they're hard to think about and reason about because it sort of seems at first glance not to be so bad, this property that you don't really know when your template expands if it's actually going to compile.

But as you start composing things more deeply, it gets worse and worse because something somewhere is going to fail and it's just going to be hard to reason about and understand. Whereas when you have type level... notions of generosity that are guaranteed to compose correctly and won't just blow up you just drive that error rate down so that's one thing that's nice about getting past templates as a language feature and then the other thing is it's just crushingly slow you're generating the

code, almost exactly the same code over and over and over again. And so that just means you can't save any of the compilation work. You just have to redo the whole thing from scratch. That's exactly right. And so this is where, again, we were talking about the sand in the system.

Little things that if you get wrong, they play forward and they cause huge problems. The metaprogramming approach in Mojo is cool, both for usability and compile time and correctness. Coming back to your point about portability, it's also valuable for portability because what it means is that...

the compiler parses your code and it parses it generically and has no idea what the target is. And so when Mojo generates the first level of intermediate representation, the compiler representation for the code, it's not hard coding in the pointers of three, two bit or 64 bit. or that you're on a xa6 or whatever and what this means is that you can take generic code in mojo and you can put it on a cpu and you can put it on a gpu same code same function and again these

crazy compilery things that Chris gets obsessed about. It means that you can slice out the chunk of code that you want to put onto your GPU in a way that it looks like a distributed system. But it's a distributed system where the GPU is actually a crazy embedded device that wants this tiny snippet of code and it wants it fully self-contained. These worlds are things that normal programming languages haven't even thought about.

so does that mean when i compile a mojo program i get a shippable executable that contains within it another little compiler that can take the mojo code and specialize it to get the actual machine code for the final destination that you need do i bundle together all the compilers for all the

possible platforms in every Mojo executable? The answer is no, the world's not ready for that. And there are use cases for JIT compilers and things like this. And that's cool. But the default way of building, if you just run Mojo build, then it will give you just an 8.0 executable. normal thing but if you build a mojo package the mojo package retains portability this is a big difference

This is what Java does. If you think about Java in a completely different way and for different reasons in a different ecosystem universe, it parses all of your source code without knowing what the target is, and it generates Java bytecode. And so it's not 1995 anymore.

The way we do this is completely different and we're not Java, obviously, and we have a type system that's very different. But this concept is something that's been well known as something that at least the world of compiled languages like Swift and C++ and Rust kind of forgotten. So the mojo package.

is kind of shipped with the compiler technology required to specialize to the different domains. And so, again, by default, if you're a user, you're sitting on your laptop and you say, compile a Mojo program, you just want...

executable but the compiler technology has all these powerful features and they can be used in different ways and this is similar to lvm where lvm had a just-in-time compiler and that's really important if you're sony pictures and you're rendering shaders for some fancy movie but that's

not what you'd want to use if you're just running a C++ code that needs to be ahead of time compiled. I mean, there's some echoes here also of the PTX story with NVIDIA. NVIDIA has this thing that they sort of hide that it's an intermediate representation, but this thing called PTX, which is a portable bytecode, essentially.

And they, for many years, maintained compatibility across many, many different generations of GPUs. They have a thing called the assembler that's part of the driver thing for loading on. And it's really not an assembler. It's like a real compiler that takes the PTX and compiles it. down to SAS, the accelerator-specific machine code, which they very carefully do not fully document because they don't want to give away all of their secrets.

Programmer Control and Unlocking Compute

And so there's a built-in portability story there where it's meant to actually be portable in the future across new generations. Although, as you were pointing out before, it in fact doesn't always succeed. And there are now some programs that will not actually make the transition to Blackwell. So that's in the category that I'd consider to be like a virtual.

machine, very low level virtual machine, by the way. And so when you're looking at these systems, the thing I'd ask is, what is the type system? And so if you look at PTX, because as you're saying, you're totally right, it's an abstraction between a whole bunch of source code on the top end and then the specific SAS hardware thing on the back end. But the type system...

isn't very interesting. It's pointers and registers and memory, right? And so Java, what is the type system? Well, Java achieves portability by making the type system in its bytecode. expose objects and so it's a much higher level abstraction dynamic virtual dispatch that's all part of the java ecosystem it's not a bytecode but the representation that's portable maintains the full generic system

And so this is what makes it possible to say, okay, well, I'm going to take this code, compile it once to a package, and now go specialize and instantiate this for a device. And so the way that works is a little bit different, but it enables, coming back to your original question of safety and correctness, all the checking to happen the right way.

Right. There's also a huge shift in control. With PTX, the machine-specific details of how it's compiled are totally out of the programmer's control. You can generate the best PTX you can, and then it's going to get compiled. How? Somehow. Don't ask too many questions. It's going to do what it's going to do.

whereas here you're preserving in the portable object the programmer-driven instructions about how the specialization is going to work. You've just partially executed your compilation. You've got partway down, and then there's some more that's going to be done at the end when you pick actually where you're going to run it.

Exactly. And so these are all very nerdy pieces that go into the stack. But the thing that I like is if you bubble out of that, it's easy to use. It works. It gives good error messages, right? I don't understand the Greek letters, but... I do understand a lot of the engineering that goes into this. The way this technology stack builds up, the whole purpose is to unlock compute. And we want new programmers to be able to get into the system, and if they know Python, if they understand.

some of the basics of the hardware they can be effective and then they don't get limited to 80 of the performance they can keep driving and keep growing and sophistication and maybe not everybody wants to do that they can stop at 80 but if you do want to go all the way then you can get there

Maintaining Simplicity and Power

So one thing I'm curious about is how do you actually manage to keep it simple? You said that Mojo is meant to be Pythonic and you talked a bunch about the syntax. But actually one of the nice things about Python is it's simple in some ways in a deeper sense. The fact that there isn't by default a complicated type system.

with complicated type errors to think about. There's a lot of problems with that, but it's also a real source of simplicity for users who are trying to learn the system. Dynamic errors at runtime are in some ways easier to understand. I wrote a program and it tried to do a thing and it tripped over this particular thing and you can see.

see it tripping over and in some ways that's easier to understand when you're going to a language which for both safety and performance reasons needs much more precise. type level control how do you do that in a way that still feels pythonic in terms of the base simplicity that you're exposing to users

I can't give you the perfect answer, but I can tell you my current thoughts. So again, learn from history. Swift had a lot of really cool features, but it spiraled and got a lot of complexity that got layered in over time. And also one of the challenges was is it had a team that was paid to add features to Swift. It's never a good thing. Well, you have a C++ committee. What is the C++ committee going to do? They're going to keep adding features to C++. Don't expect C++ to get smaller.

it's common sense and so with mojo there's a couple of different things so one of which is start from python so python being the surface level syntax enables me as management to be able to push back and say look, let's make sure we're implementing the full power of the Python ecosystem. Let's have lists and for comprehensions and like all this stuff before just inventing random stuff because it might be useful.

But there's also, for me personally, a significant back pressure on complexity. How can we factor these things? How can we get, for example, the metaprogramming system to subsume a lot of complexity that would otherwise exist? And there are fundamental things I want. want us to add, for example, checked generics, things like this.

because they have a better UX. They're part of the metaprogramming system. They're part of the core edition that we're adding. But I don't want Mojo to turn into a add every language feature that every other language has just because it's useful. I was actually inspired by and learned a lot from Go.

And it's a language that people are probably surprised to hear me talk about. Go, I think, did a really good job of intentionally constraining the language with Go 1. And they took a lot of heat for that. They didn't add a generic system. And everybody, myself included, were like, ha ha ha, why doesn't this language even have a generic system? You're not even a modern language. But they held the line. They understood how far...

people could get. And then they did a really good job of adding generics to Go, too. And I thought they did a great job. There was a recent blog post I was reading talking about Go, and apparently they have an 80-20 rule. And they say they want to have 80% of the features with 20% of the complexity, something like that. And the observation is, is that that's a point in the space that annoys

Everybody, because everybody wants 81% of the features, but 81% of the features maybe gives you 35% of the complexity. And so figuring out where to draw that line. in figuring out where to say no. For example, we have people in the community that are asking for very reasonable things that exist in Rust. And Rust is a wonderful language. I love it. There's a lot of great ideas.

Pull shamelessly good ideas from everywhere, but I don't want the complexity. I often like to say that one of the most critical things about a language design is maintaining the power-to-weight ratio. You want to get an enormous amount of good...

functionality and power and good user experience while minimizing that complexity. I think it is a very challenging thing to manage. And I think it's actually a thing that we are seeing a lot as well. We are also doing a lot to extend OCaml in all sorts of ways, pulling from all sorts of languages, including Rust.

And again, doing it in a way where the language maintains its basic character and maintains its simplicity is a real challenge. And it's kind of hard to know if you're hitting the actual right point on that. And it's easier to do in a world where you can take things back.

try things out and decide that maybe they don't work and then adjust your behavior. And we're trying to iterate a lot in that mode, which is the thing you can do under certain circumstances. It gets harder as you have a big open source language that lots of people are using. That's a really great point. And so one of the other lessons I've learned was

Backwards Compatibility and AI Tooling

Swift is that with Swift, I pushed very early to have an open design process where anybody could come in, write a proposal, and then it would be evaluated by the language committee. And then if it was good, it would be implemented and put into Swift. Again, be careful what you wish for. That enabled a lot of people with

good idea is to add a bunch of features to swift and so with mojo as a counterbalance i really want the core team to be small i want the core team not just be able to add a whole bunch of stuff because it's it might be useful someday but to be really deliberate about how we add

things, how we evolve things. How are you thinking about maintaining backwards compatibility guarantees as you evolve it forward? We're actively debating and discussing what Mojo 1.0 looks like. I'm not going to give you a time frame, but it will hopefully not be...

very far away. And what I am fond of is this notion of semantic versioning. And so saying we're going to have a 1.0 and then we're going to have a 2.0 and we're going to have a 3.0 and we're going to have 4.0, etc. And each of these will be able to be incompatible. but they can link together. And so one of the big challenges and a lot of the damage in the Python ecosystem was from the Python 2 to 3 conversion. It took 15 years and it was a heroic.

mess for many different reasons. The reason it took so long is because you have to convert the entire package ecosystem before you can be 3.0. And so if you contrast that to something like C++, let me say good things about C++. They got the ABI right. And so once the ABI was set, then you could have one package built in. C++ 98 and one package built in C++ 23 and these things would interoperate and be compatible even if

You took new keywords or other things in the future language version. And so what I see from Mojo is much more similar to the, maybe the C++ ecosystem or something like this, but that allows us to be a little bit more aggressive in terms of migrating code and in terms of fixing bugs and moving language forward. But I want to make sure that...

Mojo 2.0 and Mojo 1.0 packages work together and that there's good tooling, probably AI driven, but good tooling to move from 1.0 to 2.0 and be able to manage the ecosystem that way. I think the type system also helps an enormous amount. I think one of the reasons the Python migration was so hard.

You couldn't be like, and then let me try and build this with Python 3 and see what's broken. You could only see what's broken by actually walking all of the execution paths of your program. And if you didn't have enough testing, that would be very hard. And even if you did, it wasn't that easy. Whereas with a strong type system, you can get an enormous amount of very precise guidance. And actually the combination of a strong type system and an agentic coding system is awesome.

We actually have a bunch of experience of just trying these things out now where you make some small change to the type of something. And then you're like, hey, AI system, please run down all the type errors, fix them all. And it does surprisingly well. I absolutely agree. There's other components to it. So Russ has done a very good job with the stabilization.

approach with crates and apis and so i think that's a really good thing and so i think we'll take good ideas from many of these different ecosystems and hopefully do something that works well and works well for the ecosystem allows us to scale without being completely constrained by never being able to fix something

The Future Evolution of Mojo

And once it gets, you should put 1.0. I'm actually curious just to go to the agentic programming thing for a second, which is having AI agents that write good kernels is actually pretty hard. And I'm curious what your experience is of how things work with Mojo. Mojo is obviously not a language deeply embedded in the training set that...

these models were built on. But on the other hand, you have this very strong type structure that can guide the process of the AI agent trying to write and modify code. I'm curious how that pans out in practice as you try and use these tools. You know, so this is why Mojo being open source. And so we have hundreds of thousands of lines of Mojo code that are public with all these GPU kernels, like all this other cool stuff. And we have a community of people writing more code.

having hundreds of thousand lines of mojo code is fantastic you can point your coding tool cursor or whatever it is at that repo and say go learn about this repo and index it. So it's not that you have to train the model to know the language, just having access to it that enables it to do good work. And these tools are phenomenal.

And so that's been very, very, very important. And so we have instructions on our webpage for how to set up these tools. And there's a huge difference if you set it up right so that you can index that or if you don't. And make sure to follow that markdown file that explains how to set up the AI product tool.

So I want to talk a little bit about the future of Mojo. I think that the current... way that modular and you have been talking about mojo these days at least it's a replacement for cuda an alternate full top to bottom stack for building gpu kernels for writing programs that execute on gpus but that's not like the only way you've ever talked about mojo you've also

Especially earlier on, I think there was more discussion of Mojo as an extension and maybe evolution of, and maybe eventually replacement of Python. And I'm curious. How do you think about that now? To what degree do you think of Mojo as its own new language that takes inspiration and syntax from Python?

And to what degree do you want something that's more deeply integrated over time? So today, to pull it back to what is Mojo useful for today and how do we explain it? Mojo is useful if you want code to go fast. If you have code on a CPU or a GPU and you want it to go fast, Mojo is a great thing.

One of the really cool things that is available now, but it's in preview and it will solidify in the next month or something, is it's also the best way to extend Python. And so if you have a large scale... Python code base. Again, tell me if this sounds familiar. You are coding away and you're doing cool stuff in Python. And then it starts to get slow. Typically what people do is they have to either go rewrite the whole thing in Rust or C++ or they...

carve out some chunk of it and move some chunk of that package to C++ or REST. This is what NumPy or PyTorch or like all modern large-scale Python code bases end up doing. If you look up on the mirrors and look at the percentage of programs that have C extensions in them, it's shockingly high. A really large faction of Python stuff is actually part Python and part some other language, almost always C and C++.

a little bit of rust that's right and so today this isn't distant future today you can take your python package And you can create a Mojo file and you can say, OK, well, these for loops are slow. Move it over to Mojo. We have people, for example, doing bioinformatics and other crazy stuff I know nothing about saying, OK, well, I'm just taking my Python code. I move it over to Mojo.

wow now i get types i get these benefits but there's no bindings the pip experience is beautiful it's super simple you don't have to have ffis and nanobind like all this complexity to be able to do this you also are not moving from

Python with its syntax to curly braces and borrow checkers and other craziness, you now get a very simple and seamless way to extend your Python package. And we have people that say, okay, well, I did that and I got it first 10x and 100x and 1000x faster on CPU. But then because it was

easy. I just put it on a GPU. And so to me, this is amazing because these are people that didn't even think and would never have gotten on a GPU if they switched to Rust or something like that. Again, the way I explain it is Mojo is

Good for performance. It's good if you want to go fast on a GPU, on a CPU, if you want to make Python go fast, or if you want to, I mean, some people are crazy enough to go whole hog and just write entirely from scratch Mojo programs. And that's super cool. If you fast forward six, nine months, something, I think that Mojo will be a.

very credible top to bottom replacement for Rust. And so we need a few more extensions to the generic system. And there's a few things I want to bake out a little bit. Some of the dynamic features that Rust has for the existentials, the ability to make a runtime trait is missing in Mojo. And so we'll add a few of those kinds of features. And as we do that, I think that'll be really interesting as an applications level programming language for people to care about.

This kind of stuff. You fast forward. I'm not even project a time frame, maybe a year, 18 months from now. It depends on how we prioritize things and we'll add classes. And so as we add classes, suddenly it will. look and feel to a Python programmer much more familiar. And so the classes in Mojo will be intentionally designed to be very similar.

And at that point, we'll have something that looks and feels kind of like a Python 4. It's very much cut from the same mold as Python. It integrates really well from Python. It's really easy to extend Python. And so it's very much a member of the Python family, but it's not. compatible with Python.

What we'll do over the course of n years, and I can't predict exactly how long that is, is continue to run down the line of, okay, well, how much compatibility do we want to add to this thing? And then I think that at some point people will consider it to be a Python superset and effectively it will feel just like the best way.

do Python in general. And I think that that will come in time. But to bring it all the way back, I want us to be very focused on what is Mojo useful for today. And so great claims require great proof. We have no proof that we can do this. vision and a future in my brain and i've built a few languages and some scale things before and so

I have quite high confidence that we can do this, but I want people to zero back into, okay, if you're writing performance code, if you're writing GPU kernels or AI, if you have Python code, you want to go slow, a few of us have that problem, then Mojo can be very useful. And hopefully it'll be even more useful. to more people in the future. Right, and I think that already the practical short-term thing is already plenty ambitious and exciting on its own.

Seems like a great thing to focus on. Yeah, let's solve a heterogeneous compute in AI. That's actually a pretty useful thing, right? All right, that seems like a great place to stop. Thank you so much for joining me. Yeah. Well, thank you for having me. I love nerding out with you and I hope it's useful and interesting to other people too. But even if not, I had a lot of fun with you. You'll find a complete transcript of the episode along with show notes and links at signalsandthreads.com.

Thanks for joining us. See you next time.

This transcript was generated by Metacast using AI and may contain inaccuracies. Learn more about transcripts.
For the best experience, listen in Metacast app for iOS or Android