Will Computers prove theorems? - podcast episode cover

Will Computers prove theorems?

May 15, 202546 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Kevin Buzzard: Will Computers prove theorems? Will computers one day replace human mathematicians? Is this just around the corner, or decades away? Can neural networks spot patterns which humans have missed? Currently language models are great for brainstorming big ideas but are very poor when it comes to details. Can integrating a language model with a theorem prover like Lean solve these problems? Is the modern mathematical literature riddled with errors, and is it feasible to hope that a machine might find and even fix them? Is it possible to teach a computer the proof of Fermat's Last Theorem? And what do mathematicians make of all this? I'll talk about how modern developments in AI and theorem provers are beginning to affect mathematics.

Transcript

[Auto-generated transcript. Edits may have been applied for clarity.] So if you haven't. Okay. Um, so now it's my great pleasure to introduce Kevin Hart. So Kevin did his PhD in Cambridge in 1995 and algebraic number theory, and he had, uh, positions at Princeton and Berkeley. Uh, but since 2004, he has been a professor of pure maths at Imperial. He's had some very distinguished prizes. Um, Kevin got the white House Prize, BLM Sydney 2002, for distinguished work in number theory.

And in 2008 he got the senior price of the LMS, which is for a really outstanding piece of mathematical work. In 2017, he kind of turned his attention partly from number theory to formalisation at a midlife crisis. Um, for this work, actually, he's given a plenary lecture at the international, um, Congress of Mathematicians in, uh, 2022. And in 2024, he started this kind of super exciting project to formalise the proof of Fermat's Last Theorem, which, um, was anticipated here.

And we're now going to hear about. So, um, he's going to be telling us about whether computers are going to prove theorems. Great to have you here. Thank you. So. So many thanks for the, uh, for the invitation. Uh, and, yeah, I'll do the same joke as Leo. Uh, it's an absolute honour to be giving these lectures, especially, uh, with this, this guy who basically has changed my life, uh, not least by writing, you know, forget Nintendo. Uh, uh, my my favourite computer game. This is, uh.

This is summer. I've always loved computer games, but the fact that I can do, you know, sort of interesting algebraic number theory in this completely novel way, uh, is something that still blows me away. Uh, so I kind of want to think big. So, you know, the questions after I finished. You know the questions maybe you want to consider, uh, you know, what conformal methods do for me? Your your. Yeah. Many of you probably experts in something.

Uh, well, what can formal methods do in that thing that you're an expert in? Uh, and then more generally, you know, what can formal methods do for mathematics? Because that's not really formal. Methods have existed for decades, but that is not something that mathematicians were asking themselves, uh, ten years ago. So these are, I think, two interesting questions. Uh, so that was the title I had to give a title. So I wrote Will Computers Prove Theorems? And I mentioned this on social media.

And immediately some wag says that, you know, any any time you see a title with a question mark, the answer is always no. Because if if it wasn't no, you wouldn't put the question mark. Uh, so in fact, the answer isn't no. The answer is yes. Like, already computers are proving theorems. Uh, and so yes, you get this example, this thing that Leo already mentioned, uh, a computer program autonomously solved four out of the six problems on the 2024 International Maths Olympiad.

Uh, and and of those four, uh, three of them, uh, it gave formal lane solutions. So, you know, uh, formally verified, uh, solutions to three of these questions. And the other one was the Euclidean geometry question, uh, where the system didn't use Lane, it just wrote down, uh, an easily human readable and human checkable proof. Um, and so of course, if you're a research mathematician, you would say, well, these aren't theorems, right? These are, you know, high school, you know, level problems.

Uh, they're not really what a mathematician would call it. You know, maybe a push their lemmas, right? They're not really theorems. But of course, if you think that it's going to stop there, then I think you're very naive. Uh, and so really, what I'm talking about, uh, is how we can, you know, how mathematicians and computer scientists can collaborate to take these things further?

Because this is one thing that really, really shocked me, uh, when I moved into this area, was that even though, you know, my background and Leo's background couldn't be more different, I can't write a line of computer code. Really. I know very little about, uh, programming languages, and I'm not expecting Leo to know anything about, uh, you know, advanced mathematics. And yet mathematicians and computer scientists are finding that they have a common language.

You know, we we we stuck with the maths problem. We run into something. We can't get the system to do it, but we can ask questions which computer scientists can understand and they can give us answers. Well, I find this quite, quite striking, uh, and that we can actually communicate with each other, uh, and, and, uh, and make progress, even though our backgrounds are very different.

Uh, this is something that, you know, constantly the fact that I can talk to computer scientists and they can solve my problems, uh, even though they don't understand the maths behind it, I understand. Yeah, it's a straight is, you know, it's a strange synergy and one that I was very surprised by. So yeah, we've got this. Will computers proof theory. It's a very valid question. Let's ask a more precise question.

And you know, the kind of question I'm interested in is how can you technology, uh, your theorem prove as early as talked about. But language models you can't really ignore in 2025. Uh, can these things actually give mathematicians new general purpose tools? Mathematics is a very conservative area.

Uh, it's really striking when you look at the kind of things that we're teaching the undergraduates in 2025, the courses which are offering final year undergraduates in 2025, they are essentially exactly the same as the courses that I was being offered, uh, when I was an undergraduate, you know, 30 plus years ago. Uh, so mathematicians are very slow to change and adapt, and yet there's all this stuff happening.

Uh, and the question is, can we actually, you know, can we actually use it to make new tools to use to mathematicians? You know, can we accelerate mathematics for me? Yeah. For me. You know, I used to think that prove new theorems was the important thing. But I think that this some this question really dwarfs that, you know, that that traditional idea of let's accelerate maths by proving new theorems. I think there might be other ways of doing things. So what kind of tools am I thinking of?

What kind of yeah, how are we going to change mathematics? How are we going to, uh, how are we going to disrupt mathematics? I like the idea of disrupting, uh, so yeah. Well, yeah, he's a tool. But wouldn't it be great if we had a tool where you just press a button and we can solve, you know, all the great unsolved problems in mathematics? That'd be good. That'd be a useful. That's accelerating mathematics. So the big problem with that tool is that tool is just, you know, doesn't exist.

Uh, and it's not just that it doesn't exist. I can't see a path to that tool yet. Right. It's it's not just that it doesn't exist, but given what we know. So we don't really see how to make that tool exist. There's not really even a route to making that tool exist. That's a shame. Uh, so what about. What about another tool? Wouldn't it be great? Now? Is this. This is some mathematician's nightmare, right? Is this a dream, or is this a nightmare? Wouldn't it be great?

Wouldn't it be great if we had a tool? You press a button and you got 10 million lines of incomprehensible code that you said. Let F1 be this function, F2 be this function. Let F3 be this. You know, theorem 398 is that this, this, you know, function 28 and function 374 are related in this way. And it just goes on and on and on and on and on. You have no idea what's going on. And then at the end, it's the Riemann hypothesis. And you run it through Lane.

And Lane says, yes, this is correct. You know that. You know, that's terrifying for some. Because my what's nice about is it about proving all the theorems or is it about getting understanding, you know, different mathematicians have different opinions about that. You know, do we just want to get the job done or do we want to understand what's happening? So again, that's science fiction, right? There's no route to that yet.

And so, you know, this that's this is you know, I can't see you know, there's lots of new tools and it's very exciting. But these lofty goals here are kind of out of reach at the minute, you know. But of course, five years ago getting a point on the IMO was also considered completely out, right? Five years ago there were no language models. Five years ago we had Lane.

We could do Lane. Uh, yeah we could, we could start proving, you know, research level theorems, but we couldn't get a point on the IMO, on the International Maths Olympiad. So things do move and they're moving pretty fast. Uh, so we don't really know where we'll be in five years time. Uh, but, you know, I don't want to just speculate. Let's talk about what's not science fiction now. So where are we? What about here we go. What about a tool that can find interesting patterns?

What about a tool that can spot patterns in data that humans can't spot? Uh, that would be. That would be an interesting tool. Uh, but actually we have that tool, right? That's exactly what the neural network does, right? We we learn a high school. We like given a 2D plot, given a, you know, given X and y variables, we can plot some dots, and then we learn how to draw the best fit line. Right. And I watched the neural network. Neural network. It can find best fit shapes. There are lines.

It can find best fit shapes and extremely high dimensional data. That's what the neural network does. Uh, and so that's great. So we can give it, you know, mathematical tables that we have already. And we can say, you know, can you spot stuff that we can't see. So there's a theorem. Uh, I just thought I mentioned this theorem because, uh, you have some that can be these are low dimensional topologies here in Oxford.

Uh, and they collaborated with computer scientists at Google DeepMind, and they proved a theorem about hyperbolic knots. Uh, and the way this theorem was discovered, uh, was that they generated a big table of, uh, you know, invariants of knots. Uh, and then they asked the neural network to see if it could find any patterns in this table. And, uh, the neural networks spotted that this Siggraph case, this is the signature of the knot.

The neural network could accurately predict the signature of a not given other invariants. And this was surprising because the other invariants have a slope, the volume in the activity radius. These are. These are embeddings coming from the three manifolds. You know topology, the three manifolds associated to them. These are these if you like. These are continuous invariance. These are real numbers whereas the signature is an integer.

So this is some some theorem relating the discrete and the continuous. So it must you know you know it must be profound. Uh, but the fact that these things were the fact that these things were related at all, this is what the neural network told them. And once the neural network said there is some pattern, then the mathematicians took over and found the pattern and proved the theorem. So the machine didn't prove the theorem, but the machine told them where to look for the theorem.

So that's kind of a surprising application. Uh, but it's really old. It's from 2021, so this ain't nothing else. And all the other things I cite in this talk will all be from 2024 or 2025. Um, because the air is just moving very, very fast. Uh, so really, I want to. Jordan Ellenberg, uh, is thinking a lot about this, uh, this kind of thing. How do we use neural networks? These things can find patterns in data.

How can we use neural networks, uh, to, you know, find new interesting structures, examples, counterexamples, this kind of thing. So again, uh, this pattern boost paper. Ellenberg is a mathematician, prestigious mathematician from Wisconsin. This is Jodie Williamson, another very, uh, FRS and a very famous mathematician. Adam Vargas, a young researcher. And uh, funds were shot on the works of meta.

So again, very surprising collaboration. So, you know, five years ago, you might not have expected, uh, and they, they made a very simple architecture, uh, that can search for examples of counterexamples. Uh, so in sort of finite combinatorial objects, uh, and what they're doing, basically, they write a naive human algorithm to try and solve a problem, to try and find extreme examples.

And then they write a and then they, you know, use neural networks to say, you know, they generate lots of extreme examples, uh, and then they encode them as strings. And then they say to a neural network, here's a list of strings that we really like. And now could you give us more examples of strings that we're going to like. And then the neural network say sure, I'll do that.

You don't explain the question. You you encode your finite graph, which is extremely in some way, you encode your finite graph as a string. You give it the neural network. Say, I love these strings. Give me more strings is a show. I'll give you strings. And then you try and decode the strings back to graphs. And half of them just don't decode that, you know, they garbage. And then the other half do decode into graphs.

And then you look at these graphs and they're not a great example, but they're not extreme. But you get a you can keep adding edge, you can do more. You know, the computer and the human are just thinking about the question in really different ways. And they find, you know, they they can find counterexamples to, uh, to theorems. This way is very weird. Some extremely simple to read paper. Uh, I quite like this work of Ellenberg. You should look at recent work of Jordan.

He's also collaborated with the physicists, uh, from Oxford. Uh, getting getting machines to write Python programs, which are solving maths questions. It's a very surprising, very surprising and innovative new use of technology. Uh, but the big problem with using neural networks to do this kind of thing is that they have limited application.

You know, you can't if you're if you're interested in an area of mathematics where there are lots of tables or where, you know, the fundamental objects can be described by finite amounts of data. And the interesting problems is finding, you know, finding new data, then that's great.

But if you, you know, if you are doing research in Sobolev spaces or, you know, whatever Banach manifolds, so, you know, infinite, intrinsically infinite dimensional objects where there aren't any tables, um, then you just might not be able to use these tools at all. So we've got limited applicability. And that's just to say we've seen that happen before, right. And then the 1960s and the 1970s computers became available to mathematicians.

And they were just very, very fast fancy new calculators. And all of a sudden we could do you can multiply two, 100 digit numbers together really quickly, right. And if you're doing research into, into Banach manifolds is no used to at all. But if you're doing research in sort of, you know, computational number theory, then suddenly you've got a great new tool. So we've seen, you know, computers come along and they might affect some bits of maths.

They might make some, some, some areas that suddenly, you know, an exciting new thing. You know, I mean, I'm interested in, uh. I'm interested in general purpose tools. Yeah. This is this is this is the kind of question. Can we can we really give mathematic. You know, give mathematicians a general purpose tool.

Uh, so now we have to talk about, you know, I'll talk about there in previous in a bit, but we've got to talk about language models, uh, because you know, that any, any mathematician who's use them will know that they right now they are not perfect by any means, uh, for doing mathematics. So I idea, uh, the mathematicians in there, are there any mathematicians in the room can actually use the language model and it's told them something useful. Are there any can anyone put their hand up to. Yeah.

Go. So some you see. There you go. Great. There's like ten hands. You see. But you know, the AI people are really, really excited about language models, but that they're in their current form. Uh, they're not perfect. Uh, so actually, I, I, I used ChatGPT last week. Uh, uh, because I so I as it's been, as it's been, uh, mentioned several times. Uh, I'm currently involved in some strange project related to Fermat's Last Theorem.

Uh, and I dare I, I if I click this, then all manner of things that happen. Right? I'm gonna try clicking this. Uh. Let me. Was. So anyway, as I was saying. There's some website on the internet that contains a whole bunch of maths, uh, involving some kind of, uh, that. It's a very beautiful website. Looks really nice, but rendered in rendered in beautiful latex on the web.

Uh, but I'm just in the middle of writing, uh, a very long, you know, explanation of a lot of mathematics behind Fermat's Last Theorem. And I needed the fact that if a was a dedicated domain with field of fractions K, and if L was a finite separable extension of k, and B was the integral closure of A in L, uh, then I claim that for an arbitrary k module uh, string over k with L is the same as ten string over I would be. So there's some random statement in graduate level algebra.

Uh, and that's the kind of thing that I thought that I should be able to prove. Uh, but the problem is, I just got a bit stuck because it ring advantages of a number filled ring of integers with finite extension. The new thing might not be free over. The whole thing is going to be projective, but it might not be free. Locally free, but maybe not free. So there was an at that. [INAUDIBLE], there was some technical issue, so I thought I should be able to do this right.

And I spent like two minutes on it and I was just like, this is stupid. It's like, this is graduate level algebra, right? This, this, you know, okay, so I, uh, I. So well about 20 years ago. Uh, I would have just, like, knocked on another number. Theory stories that had you do this. Uh, ten years ago. There's a new option I could ask for. Math overflow. Right? Or math dot stack exchange.

Two years ago, I could just kind of cheat and ask on the lean chat, uh, because there's a bunch of mathematicians hanging out there. Uh, uh, and it's likely that somebody will give me an answer quickly. But now, of course, you can actually ask ChatGPT for I don't pay for these things, right? Because I'm super sceptical that they're useful. But you just go to chatgpt.com and I type in my graduate level question. It gives you a list of algebra and the, uh, the answer is illuminating, right.

Because I asked this explicit question about ten tearing up finite extensions of, uh, the rings of integers, of number fields. And the first thing it does is it says, oh, suffices to prove this simpler problem. And that was something I'd spotted when I was thinking about it for two minutes, and I thought, this looks like progress. And then I decided, actually, is it progress? It seemed the reduction seems just as difficult as the original question.

So I tried that, and then I gave up on that. And then I asked ChatGPT, and the first thing ChatGPT does is finds the same reduction. Uh, and then actually, you know, it gives the idea which I'd missed. It gave the idea about, uh, how to, you know, how to prove the reduction. Uh, and then it justified its argument with rubbish. This is this, you know, is very, very wishy washy on exactly how this proof worked.

But the thing is, it gave me enough clues, uh, that I could actually put everything together. And so. So there we go. So the language model was good in the sense that it pushed me in the right direction, but it wasn't perfect because when it actually when it came to the details, it was, it was, it was, um. Not at all. Yeah. As I say, this is not an inaccurate summary, right. The details were vague or meaningless. Uh.

And why is that happening? It's because, of course, they know all the facts on the internet, right? That's, you know, that's how they're trained on the internet. But the thing is, they're not very good at logical reasoning, right? They're not very good at actually. They don't have understanding in the same way that we have understanding. Right. Can these things think it's just not even a meaningful question.

It's like asking if a submarine can swim. You said humans think these things are not thinking they're doing something else right? As Ken, I know this is a quote from Ken Ono, right? That is not cognition. They're doing recognition, right. I they've seen lots and lots of things before. And so they kind of think, oh yeah, I've seen this thing before. That's to that's what humans do in this situation. I'll do that.

Uh, and actually it's interesting, like what a mathematician actually doing this is a lot of what mathematicians are we just doing pattern matching. Occasionally we have ideas, but how often are we having ideas and how often are we just, you know, just going through the motions and. Yeah. Oh, I know, you know, just applying stuff that. Yeah, we just.

It's not immediately clear how many times we're having brilliant new ideas, and how many times we're just applying facts and techniques that we just know work because we've seen them in other situations. So we're we're maybe doing similar things to the machines. But the thing is, we can back it up with this logical reason. We take an idea from a related area, but then we could rigorously prove that that idea works in our situation, right?

The machine takes an idea from another area, attempts to apply it, and then and then, you know, it's really fluff when you ask them, does it actually work? Hey, this is the thing they never give. They never say, I don't know. The common failure mode if you ask a human a question. Yeah. A common failure mode is I don't know. Whereas you ask a machine a question, the common failure mode is [INAUDIBLE]. This is the if they they would rather make something up and say they don't know.

So this is the big question, right? If we want to use these things, how do we teach language models to reason accurately? And uh, and the way, the way that we're going to do that compared to scientists, they like, you know, they like challenges. They want they want a challenge, and then they want to be state of the art in that challenge. Right. And if you can get state of the art, you can get a paper.

That's how computer science seems to work. You you you you make some challenge and then you say and compute. Yeah, I can try this challenge. And at the end they get a grade. Yeah, they get a grade. They, they try this challenge and they get 15%. And if that's better than any other AI then the then they've got state of the art. And if you get state of the art then what you win is a paper. That seems to be how computer science works as far as I'm concerned.

Anyway. So how do we make challenges? The way you make a math challenge is a data set, right? So a quest, you know, a big collection of mathematical questions. That's what a challenge looks like. Uh, and the idea is you get the machine to answer the questions correctly, you get a point for each word you get.

Right. And, uh, and so now the question is what a mathematical data sets look like is that when you're a mathematician, you, you know, these things are being generated by computer scientists. And as mathematicians, we tend not to look at them at all. So what a mathematical data sets look like. Uh, they come in several forms. Uh, they're mathematical questions. But what level? Right. They can be. They can be sort of GCSE level. Right. The way up to research level. That's what it.

Yeah, that's the range of, uh, it's, uh, the range of the mathematics is really, really extreme. Uh, they might be a natural language, or they might be in a formal language, like languages. There are other formal languages as well. Uh, so. But of course, Leon seems to be winning. There's. And in many cases, uh, the formal language chosen is Latin for mathematics at least, uh, the problems might have shorter.

This is a big difference. The answers to the questions my either be expressible in about eight characters. Yeah, they might be a multiple choice question. Uh, or they might be. What is this number? Uh, or they might have really long answers, you know, prove this theorem. And the answer is like two pages long. So you get different data sets. Uh, you know, ranging. Yeah. The kind of answers you get can be very different.

And finally, they can be either public or private. Uh, that the private ones, the way it works is you have 100 secret questions, and they say we've got 100 secret questions, and here's here's five of them. Yeah, we'll make five of them public. And the other 95 is secret. Uh, so you get some kind of feeling as to what the data set looks like. Uh, but you can't see. Yeah, you can't see the question if you want to have a go.

Uh, you have to send your I. Yeah. To the company or the organisation that's holding the secret questions. And they run, you know, they run it, and then they give you the reports about how many you got, right? So let me just give you a couple of examples of data sets. This is the artificial intelligence math Olympiad. Uh, this is again this is Jurco. Uh, he was mentioned in the last, uh, you know, he's a guy who's funding more and more mathematics. Uh, the, uh, you know, X markets.

So there's a, uh, he's really pushing the idea of open source AI. He's very worried. The, you know, the the sort of closed source. Yeah. I was talking about wouldn't it be great if we had a general tool that help mathematicians? But one thing that perhaps would be a bit less great if that tool was owned by a private company. And I was like, just for $1,000 a month, you can have that. Your university can have that tool. Uh, yeah.

He would like he wants to see open source tools. Uh, so that I should say I mentioned this first item on the advisory board. Uh, for this for this project. So right now we've had two progress prizes. So we've had two competitions. Uh, and both of the competitions have involved private data sets. So secret questions. Uh, and the questions are written in English. Um, and there are an error that so challenging high school type questions.

Uh, not as hard as International Math Olympiad, but yeah. So there are other Olympiads. An easier level, right? Uh, and the key thing is the answers are short. Answers are always whole numbers between 0 and 9, nine, nine. And this is this is the kind of question that you see. So it's not the kind of thing you can do immediately. Uh, so each of the three digit numbers from 111 to 999, you colour it blue or yellow and there's some rule.

And then the question is what's the maximum number of yellow. So it's so sufficiently big you can't solve it by brute force. You can't do all two to the 889 possibilities. You have to you know, those a sort of a combination. There's some calculation involved and some reasoning involved as well. Uh, and we're at the stage now where you can type that kind of question into a language model, and the language model can get it right. Uh. So, as I say, the questions are kept.

Secrets. And if you want to play, then you have to give us your model. So you can't, uh. Yeah. We won't send the questions to, uh, to the to the private, to the private models. You have to upload your models of hugging. Yeah, it's a Kaggle and then Kaggle. Run it and tell you what you got out of 50. Uh, and there's prize money, right? The models compete to win six figure prizes. Uh, and there's been two competitions so far.

And the second one just finished, uh, and top, top of the leaderboard, uh, was a, you know, a gang of 3 or 4 people. And they scored 34 out of 50. So, so for high school Olympia, there's state of the art for these kind of high school Olympiad questions as far as open source models are concerned. Uh, if you asked OpenAI AI to do this, they would probably get a much higher score because the closed source models are always ahead.

Uh, so what advantage of, uh, of me being on this, uh, maybe on the organisation committee for this thing is that you get to see secret data. That's, uh. Uh, so this is a table. These are the 50 questions. Uh, is this is this competition's just finished. Uh, and then these orange bars indicate sort of the ten most popular answers that. So these are the top 500, uh, eyes that took part in this game.

And, uh, the light orange is the most popular answer. And then the dark purple is the least popular answer. And, uh, so you grade it from, uh, light to dark orange, uh, and then the grey is all the other answers. And then you take the most popular colour in green so that, you know, of these 50 questions, the top questions there, the easiest question. You can see like 90% of the models, uh, were guessing the right answer first time.

Uh, so and uh, then by the time it halfway down, you can see it's still the case that the most popular answer was equal to the, uh. I was equal, uh, to the correct answer. For the first 50% so that you see things are going wrong. Right? These things where there's very small green dots. So the, you know, like the green, the green stripes there indicate like the eighth most popular answer turned out to be the correct answer.

So that, you know, there's some data. Uh, but basically 500 language models are getting, you know, getting half of these questions correct. So there's some kind of idea as to where open source AI is now. Uh, but we don't care about these stupid undergraduate questions, right? We care about hard questions. So what about this frontier math data set? Here's a here's a proper question. Uh, here's a question about moduli spaces of conics.

So, you know, here's some intuitive geometry question. Well, it looks like an intuitive geometry question. It's actually a combinatorics question which is heavily in disguise. Uh, but you've got moduli space of conics tangent to a given set of five conics in the plane. Uh, and the question is and now you look at this, you look at these moduli space mod p and you ask how many how many connected components this got. Uh and as p varies this answer changes. This is not over FP bar over FP.

So this is an arithmetic question not a geometric question. Uh, and then the answer is give the answer to two decimal places. So again the answer's very short. Uh, but now this is research level mathematics. Uh, and one of OpenAI's closed source models got 25%, uh, on this dataset. So we've got questions of that nature. You know, that looks. That's technical and complicated. But the AI is getting 25% of that, including a correct solution to the question.

The. So that's where closed source is right now. And. And um, but when you actually ask, this is the problem. You know, it's great the you know, the model can solve that. The model get that number right. But then when you ask it to explain why that number's correct, it's wishy washy. Is it because it makes them get it has to like if you've got some Galois group and has to guess what the Galois group it is, it just guesses the gavel group is the full symmetric group. And it's right.

And then you ask it why is it the full symmetric group. And it's vague. It doesn't actually know why. But it's just you know it's just this is the problem with these number questions is they don't correctly test reasoning. They test how good you are at guessing the answer. And these models have got very good at guessing the answer.

Uh, so the close source, the close source models are around two years ahead, and the close source models are trying to pull ahead because the tech companies have stopped publishing. This is this is what's happened because they want to stay ahead and the open source models are trying to catch up.

Say, why are these datasets private? That's kind of obvious, because if they weren't private, if you just put them all on the internet, that people would start asking them on Math Overflow and people would start answering them, and then the answers would be on the internet, and then the models would learn the answers, and then the models would just all get 100%, so it would break the system. And now what are the RSS numbers? That's a much more pertinent question.

What are their numbers? Just because we want to be able to market really easily. Right. If you ask a question and if you ask the how do you prove this theorem? If the answer is a two page long proof. Who's going to mark that right. You can't have I marking its own homework. This is the problem right? If if you if you get if the question is prove this theorem and the questions in English and the answer is in English,

you see where this is going, right. If the questions in English and the answers in English and the answer is two pages of LaTeX, and we have to decide how many points out of ten this answer, a human has to read it right. And that's really expensive. Right. So the computer scientists want to get state of the art. But the problem is marking a proof. Right. That's what we do. We don't. We don't ask, what is this number?

We ask how to prove this theorem. But it it's too expensive to mark that in natural language. So the natural language data sets are all what is this number. Questions. Right. That's not what we do. Okay. So formal data sets right. And these are data sets now. Now it's different. Now the questions are not written in English. They're written in lead. Uh, all these other languages. But yeah, the vast majority of them are written in lean.

And now, of course, you can ask the interesting question is I prove this theorem right? And now the idea is this time. Now the model writes a model writes a solution. And now we just throw it into lean and see if it compiles. Yeah, because now a machine can grade the answer, right? If it's wishy washy, it's just not going to compile, right. That's what we do. And now, of course, the big problem, uh, unfortunately, uh, is that right now the level is terrible.

The level is terrible. The data sets are mathematically trivial, right? So we have these formal data sets where the questions are in line, but they're GCSE. They're really very, very easy stuff. So the hardest one I'm aware of currently for these formal data sets is Putnam Bench, uh, which is formalised statements of 640 uh Putnam's uh Putnam questions. So the Putnam is an undergraduate level, uh, undergraduate level problems.

Uh, so hard undergraduate level problems and state of the art is abysmal, right? State of the art is ten out of 640. Uh, so that's good when you see state of the art like that, because it means there's plenty of room for improvement. Well, by the time that the by the time the state of the art is 60%, then you really you've lost interest like people like, oh, great, I've got 65%. But like if you're solving half the problems, then you know won't be.

Give us a year and you'll have solved all the problems. But right now this is undergraduate level material formalise and lead and people can't do it right. Apparently OpenAI's most recent. Oh, no. No, this is this Chinese model, deep sea prover A claiming they've got they've got more than ten, but it's yet to be. And you know, the the Putnam people have to officially check this, but apparently, uh, there's been a big jump this week, so we'll wait and see.

But what do we really want? Like, what am I trying to sell here? I want to. I want these systems to do the kind of mathematics that I want them to do. The periodic Langlands program. Right. That's what we want. And, uh, because, you know, that's what's happening in places like this. And so data sets are private. That's a bit rubbish. Like data sets are too easy.

That's a bit rubbish. Uh, so there's a very good paper by, uh, by Simon Fraser again, uh, talking about, you know, the general problem of mathematical data sets and just flagging all the things which are somehow insufficient currently. Uh, but I do believe that when we get it right and we start making data sets that are actually correctly modelling the kind of mathematics that researchers are doing, you will get computer scientists, you know, trying to get state of the art on these data sets.

Right. So where are we? Like, language models are rubbish. Language models make stuff up. Uh, current data sets aren't asking the right questions. Uh. And so now let me finish this talk with a vision. Uh, this isn't science fiction. Uh, and is maybe a route to some kind of tool that mathematicians can use. Uh, so let me go all the way back to 2023, uh, where language models that first appeared and they couldn't compute.

Right. And it was really funny you asked them, you asked ChatGPT to multiply 210 digit numbers together. You give it two explicit two digit numbers together, and it confidently gives you a 20 digit answer. And it begins in the right number, and it ends in the right number. And in the middle, all the numbers are just made up. That was like that was what it looked like in 2023, right? But in 2024, when you ask it to multiply 210 digit numbers together, it just follows a python.

You know, it fires up Python to ask Python what the answer is, and then it reports the answer because these systems learned how to write Python code. And how did they learn how to write Python code? Well, they just train them on Python. There's billions of lines of Python code on the internet. So the language models just train on that Python code. And then I got really good at writing Python. And now when you ask some computation questions the python's good at it.

Just, you know, it just gets Python to do it. But right now, as I've said several times, they can't reason at a high level. Right. And the the harder the mathematics gets, the worst, I guess, at reasoning. Okay, so maybe we should just teach them to write lean code. That would be good because then, you know, that stops them [INAUDIBLE], right? It's you know, you ask them to write, you say prove this theorem. And then internally it proves it in lean and that it translates it back into human.

You know, it's going to get it right. But the problem is we can't train it only in code because there's only millions of lines of leading code. And that's and a lot of that is undergraduate level, right? This is the problem we want to break through on. We want to get them to do research. We want to break through undergraduate level. And most of it is undergraduate level.

So undergraduate level. That's great. Are you? DeepMind has shown us that you can train on what's already there and you can get heart. You know, you can solve half 50% of the 2024 IMO, given given only what we have. But as I say, that's lemmas, right? That's not theorems. So how are we going to get this further? And we just lied. This is a big problem. We need to think seriously about getting a lot more modern mathematics. Formalised. It seems to me this is this is the bottom.

This is the problem. We need to get a lot more modern mathematics formalised. And because that does not this is not science fiction, right? Maybe, maybe we can get machines to do that. Maybe because it's one thing coming up with your own ideas, but actually translating from a human proof into a computer proof that really does not sound. Yeah, that's much easier than coming up with your own proof. Just doing that. Like we can translate between English and Chinese.

So I Gobbi translate between English and lean. Right. That that doesn't sound crazy to me. And of course, you know, people are thinking about this. And so finally, whether this is one of the reasons why I'm currently teaching Lena proof of Fermat's Last Theorem. So this was proved in the 1990s by Andrew Wiles and, uh, ably assisted by my advisor, Richard Taylor. Uh, the proof is huge. Uh, I mean, the Wiles paper is album pages, but it's got 80 references.

Uh, like I said, the proof is over 200 pages either if you want to write down the proof. Yeah. From the axioms of mathematics, it would be more like 2000 pages. I think it would be a more reasonable estimate of how long this proof is. Uh, and and the weird thing about the proof is that the Fermat's Last Theorem was a question about positive whole numbers. But the proof goes through all manner of really complex structures.

Right. All these modern buzzwords. Uh, elliptic curves and modular forms and Galois representations of finite flat group schemes. And we've got them. All right. There's you know, there's an example. Yeah. Those are examples of things that over the last couple of years, we've now carefully formalised these things in lean. So that in I mean, you know, it's in my Fermat's Last Theorem repository and uh, it's slowly moves in, in my repository as a feed, a repository for math lib.

Uh, so elliptic curves are modular forms are math lib, uh, and the other two things, Galois representations and group schemes. They're not in Matlab yet, but in a year's time they will be. That's, uh, that's how the project is going. You know, I'm I'm building stuff. But, you know, when we make a new definition, we make sure it works. You know, we make sure it's usable. And then when we're convinced we've got it right, uh, we push it. But but it's. No, it can't just be me. Right?

It can't just be me doing that. There's more to mass than Fermat's Last theorem. It turns out. Uh, there's other there's other areas of mathematics. And here are just some dumb examples of stuff that are like undergraduate or beginning graduate level mathematics that are not formalised. Yeah, that are not formalised. And as far as I know, any theorem prover. And I just need someone to sit down. You know, it's not difficult to sit down and teach the machine.

It's not just the definition. The definition and basic facts about the definition. Right. But the thing is, we can't get machines. We can get machines to prove theorems. But it's really difficult to get machines to write down definitions because formalising definitions is an art. Right. Is it system specific? Right. Well, the best way to to teach lean, uh, what are not is, isn't necessarily the best way to teach Isabel or to teach math and math.

Right. It's system specific. Uh, and also, you've got the problem. The machine isn't holding your hand at this point, right? The machine will check that your definition is syntactically correct, but the machine won't check. It's semantically correct. The machine won't check that you've written down the right definition. The machine will just check. You've written down a definition that makes sense. So you've got to be super, super careful here. So this is something.

This is something that human experts need to do. Right to theorems. We can let the software do it itself. Right. In two years time, maybe the systems will be translating from human maths. Translate a human proof into a living proof. That is absolutely not out of the question. But for me, formalising definitions you absolutely need to take anything in machine says with a very big pinch of salt, right?

So who is going to do? Who is going to formalise all these definitions in all the graduate courses that are being run in Oxford? Right. It's not going to be computer scientists. Okay. Whose job is it? I mean, I claim. I claim that it's the mathematics research community's job, right? We've got we've got to do. We are the people who are qualified. But the problem is, we're not going to be rewarded for that, right? The reward system in maths is that you get points, you get money.

If you get points in the research assessment exercise and you get points in the research assessment exercise by papers. Do you get papers by proving therapies? New theorems builds upon old theorems which builds upon old theorems, not all of which are correct. As I found out when I was trying to teach the computer proof of Fermat's Last Theorem. We're not as smart as we think we are. So we've got this idea that everything in the lecture is fine and we just need to build more.

Right. But it's is is this really is this really how we're going to change mathematics? Right. I think that this whole, this AI and theorem prove is that together they are giving a completely new way of thinking about mathematics. But but the system isn't, you know, the system. There's something wrong. Right? That that's the problem. You know, and actually actually. Yeah. What can full moon message do for mathematics? Okay. Can they really turn it on its head? I mean, that's it. Thank you.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android