I. Okay. Welcome, everybody, to the strategy lecture. I'd like to start by expressing our gratitude to Oxford Asset Management, which has really generously supported the lecture. So I want to thank them for that. I also want to start with one announcement, which is that there will be refreshments after the lecture just outside.
So everybody is invited. Please do join. And then the most pleasant task will be introducing less volume to is going to be speaking to us about whether one can define intelligence as a computational problem. So let's delve into one of these people who, as many of you know, invents one research field after another. So when I thought of which things to tell you, I just picked a small selection of them. And so one is less did a lot of work in algebraic complexity theory.
So the complexity classes, VHP and BNP, which is still a very active area, the V stands for Valiant. Then we moved on and started the complexity of counting and the class number, and there's a whole community that works in that. Then we went on and founded Computational Learning Theory, so the book probably approximately correct was really influential and basically was the first rigorous study of what can learn after that not being enough.
Let's became interested in classically simulating quantum, and he discovered holographic algorithms and that whole area, which is another concern. Then went on to write his book, Circuits of the Mind, which is about computational analysis and studying the human brain. And I think many of these threads will be brought together today. It's traditional in these talks to mention prizes, but Les is actually one actually pretty much every prize.
So I picked out just four so that we would get on to the talk. So he's won the Never Linda Prize in 1986, became an FRC in 1991, won the Knuth Prize in 1997, and the Turing Award in 2010. And I looked up letters on mathematics, genealogy and found 109 descendants. But I'd like to point out that 13 of them are right here in our department. So we we all owe quite a lot to Les, not just for his intellectual stimulation, but also for many mentoring.
So we're really. Oh, and I forgot. Please fill out the question. And so I'm delighted to introduce Les Valiant. Well, thank you very much, Leslie, for the very kind introduction, and thank you very much for inviting me here. So also, about 30 years ago, I spent a very happy sabbatical year in Oxford. I was treated very well. So I do have very happy memories of Oxford and I'm very glad to be to be back.
So what I'm talking about is kind of a theoretical approach to a I. And in brief summary, it's a way of reconciling machine learning and reasoning. So it's a topic which is close to my heart and has been for a long time. But in giving in talks, often the hardest thing to understand is why someone is doing this kind of thing at all. So I'll be slightly self-indulgent and try to explain the motivation of of this kind of approach.
And so first, I want to discuss this notion of a computational phenomenon, which not many people discuss, but to no avail. So, you know, algorithms have been around for a long time, and Euclid had had a very good algorithm given to numbers. You can find the greatest common divisor and efficiently. And so if now, for example, if I give you one number and you want the divisor, you know, which is factoring it, you know, that takes exponential time. So no two numbers can find a common factor.
So there's something very striking already. And and what people knew about the algorithms a couple of thousand years ago and many of the best algorithms we know are ancient. So what is computer science contributing in general? Well, of course, the big change was Turing's paper in 1936. So I'll start by trying to spell out my view of what the big event was.
And I will. Discuss this notion of a computational phenomenon, which for Turing, the phenomenon involves computation itself and the model of computation, which for him was a Turing machine. And the best way of explaining this, these ideas is, is by making an analogy with physics. So I'm not trying to say that computer science is physics are the same, but analogies do serve some purpose. So here the claim is that so what do you have in physics?
You have some laws like F equals M.A. and the Gravitation Law. So you've got some laws which are believed to hold generally, but what they really are supported by mathematical theorems which are consequences, deductions from the laws and with which you can really understand the incredible breadth of, of what the law means and the analogy. Okay, so we learned about this a long time ago.
So as a computer scientist, sometimes I have wondered what are we offering in a comparable to what the physicists have been doing? And on reflection, I think what we are doing is what's Turing did is that what corresponds to the law in physics is a model of computation. So he defined a model Turing machine. And and the general pain was that it kind of it captures computation in the real world in a very significant and general sense.
And this is a big statement, but it's supported by mathematical consequences exactly like physicists do. So, for example, important consequences is that there's a universe give us a Turing machine, which, without his notion, he can discuss the non computable problems and other very important things about models of computation. Is that the robust if you make small changes, it shouldn't change the power.
So I think this is what the main thing computer science offers and what the rest of us since have been trying to emulate in other ways. And so the idea here is that as an even more general level, of course, what Newton implied is that you can capture these laws of physics by equations. And I'm claiming that because the general statement is that that is phenomena in computation, and you should capture them by models of computation.
So Turing's example was that there was competition in general and the model was Turing machines. But much of what we've been doing since in the algorithms area has been on the same tracks. So for example, okay, so, so that's a side comment. So I'm describing an analogy with physics and I'm just pointing out that there are other analogies other people use. So some people use the analogy that maybe unproved mathematical conjectures like P, not equal turn p, we should treat like physical laws.
Okay, that the things people believe you can't prove, let's believe it until someone disproves it. So that's okay. I've got no quarrel with that. But I'm really drawing a different analogy which this kind of more general book I try to it's expansive, a bit more detail than here. So the physical laws are which are true. That's true, but not provable. It's like a model of computation. The claim that the model of computation is a valid for real, real phenomenon.
Okay. And okay. So for example, another phenomenon is search. And in fact, the best description of this in words is the phrase mental search, which Turing used already in 1948. So the idea is that you're searching, searching for oil in the ground. You're searching for something in your head, and you're like searching for factors of a number. And the definition is called NP, where you're searching for solutions which are short compared to the input size.
And also given this solution, you can easily verify it. So this is a formalisation of search and P is a rigorous is a formal statement of it. And we believe that NP is a real phenomenon and computation which lots of people find useful. And the model of computation is this non deterministic Turing machine.
And again, the interest of this definition by itself is not so impressive, but the interest is that the powerful mathematical statements and an important one is that the hardest search problems and be complete problems. And there's a model of computation and some stunning mathematical statements which are surprising, which make it okay. So that's one. Okay. So as it happens, so by P, I mean roughly what we're sure we can compute efficiently.
In this universe in polynomial time I should put it in randomisation, so it's kind of a stand in for that. So computer will captures everything that is during staying the most. General ANP is like a subclass. So in other subclasses as a Sharpie, some people here call it number three, which is the counting version. And again, this is like a. Okay. Okay. So yet another one is BQ P, which is quantum polynomial time. So this is our best effort at describing what a quantum machine what?
How if you use quantum theory for computation, what you'd get. And so with each of these classes, again, I think it's the same story that there's a model of computation. This is bounded quantum polynomial time and again. Besides the suggestion that we should use quantum theory to compute, there are some mathematical consequences. And for example, a very powerful consequence is the second one.
Powerful theorem is that the many ways you could try to use Quantum to compute, and it turns out that all equivalent. So that's a strong results. So by looking at this model of computation, you do arrive at the conclusion that there's a real competition phenomenon, which is right. And so another result is that BQ P is in fact reducible to sharpies. So the counting counting says problem. If good counting says problems are more powerful than the quantum class.
Okay, so we've got this various classes with different. Uh. Power and there are more. So another phenomenon is a is for this captures the idea of games. So do I have a. No, I don't have that yet. So this is your kind of game theory. And again, you know, powerful results about that are that the compete problems the hardest members of the game theory class. So of course mathematically we don't know all these forces could collapse for all we know.
And it may be that all these things well up to short P it may be that you can do everything in polynomial, polynomial time even efficiently. But even if that happens to the case, one can still discuss these. As far as for the phenomena, I think and others extended discussion of of that. But so it's with this background that I think I have approached topics. So if one looks into, into, into, into machine learning, then so I've tried to formalise the notion of, of supervised learning.
So we discussed that more code back learning probably possible to correct learning, which is a subclass of of, of P which which roughly means that is believed to be a subclass is that there are many things that could write a program for by the program could fit into your computer or the universe. But this belief that most of this you can't learn from examples.
So learning is harder than just computing. And okay, so the question of of where the speculating thing is within P, it's a. Uh, you know, some important stuff. For example, one observation is that cryptography lives out the difference. So, uh, so public key cryptography wouldn't exist if any function. You could easily learn from examples because then you could kind of learn all the secrets. So negative results about, uh, complexity are used every day, especially by cryptographers.
Okay. Okay. So with the pack learning again, you have a model of computation. And so this model, which I describe in more detail, captures the notion of supervised learning, which is a well-known concept and widely practised, of course. And again, by following this, having a formalisation, uh, so obviously questions of robustness are important. If I define this class in different ways, do I get different classes?
It's important that it's a robust many variants give you the same class and and so some consequences of for example that you can give a rigorous demonstration that learning algorithm does really generalise. You know, it's a generalisation which used to be some philosophical issue not many, many decades ago, of course, now is practised by machines. And you can also explain why some algorithms are, you know, predict in a certain sense, there's no nothing magical about them.
Okay. So. Okay. So I want to describe actually a little bit because I won't I'll build later on on the on that. So it's a form of formalisation, of supervised learning and so supervised learning. So there's this term supervised learning, unsupervised learning, which I use in very general senses. And one reason for formalising it is that at least it defines what we're discussing. But generally it's, it's talk about learning where there's some feedback.
So supervised learning doesn't mean that there's a supervisor necessarily. So, for example, I can look around the room and learn something about the average audience in a computer science lecture in Oxford is like the average age or something like that. There's no supervisor telling me things. I'm doing this because from other knowledge I can label people myself. I know roughly how old everyone is. Okay, so I can learn without an external label.
So supervised learning doesn't mean that it has to be a supervisor. And so essentially it means there's any kind of feedback. It's, it's supervised learning. So unsupervised learning is where there's truly, truly no feedback. You know, it's some pattern. And somehow we're supposed to draw some conclusion. But certainly I think the or the impact of machine learning recently is all this feedback, the supervised ending. Okay, so what's this formalisation?
So the idea is that there's some space through space where it's taught it is an example maybe of a flower and you're trying to classify flowers into, into what species they come from. They have types. And B there's a truth. F is a ground truth separates is s from the bees. And the learner also has a hypothesis which classifies examples. And in any rich enough world worth talking about, there will be errors. Okay. And we assume serious.
It's a very rich world exponentially. Many different kinds of examples may maybe influenced infinitely many different examples. So it's we want to talk about something realistic. And so what's. What's this? Uh, what's the supervised learning phenomenon? It seems amazing. It works. People celebrate it, even even in the popular press. So what this formulation is first the three, three points. So one is that it's an efficiency criterion. So it says that there will always be errors.
But the more examples you take and the more computation you apply, you should be able to reduce your error fairly fast. Okay. It should be rewarding to put more effort into into it, into learning. So if you double the amount of effort or the number of examples, you should see the increase decrease in the error. And so this is a quantify something quantified and the important thing is that it goes down algebraically.
So if you have an examples there may be errors are good good at one over the square root of n maybe one over the 10th root of N, but it wouldn't or shouldn't be slower than that. Okay. So. And of course this is a realised realised gain that basically people in the last ten years have increased the budget in data and computation by a factor of maybe thousands. And this has brought really good rewards. So this is okay.
And for some simple learning algorithms, you can prove that the thing learns and it learns so fast. Okay. And so this is just in pictures. So the quantitative aspect is that the more effort you put in, the error goes down as power of the effort. So it may be, you know, one over into the half. Okay. So if you want to reduce the error by a factor of two, maybe you should put in the fact a fixed factor like 100 or four.
Okay. And some people actually experimentally verify this that for some task of predicting next words various deep learning. Uh, algorithms do have this linear this to go to this polynomial resulting in error. So this is a log log scale, so you straighten out the curve to a straight line. And so also speculating doesn't tell you what power law this should be. And in fact, there's evidence that different applications do have different vowels.
So if you do some natural language data sets or a vision data set, they have different power. They have different power laws. Okay. So like, like this one is assumed to be a very slow power that's like point sort of fixed power, but it's .06 good enough. Okay. So this is an efficient efficiency that. Okay. So so we demand before we call a machine learning algorithm successfully, we demand this efficiency criterion because the two other aspects are one is that we want to be realistic.
The world is complicated. So the last thing we want is to solve this, okay, is to make an assumption so we know that something's a something is a BS, but in different worlds, maybe the different probabilities of each, each kind. So if you come here to go to China, maybe the same flowers, but with different possibilities. So, so this requirement is that the second requirement is that this learning algorithm work for arbitrary distributions.
So the secret, of course, is that you're going to learn on or learn on a distribution and you're going to have to perform on the same distribution. So for certain, the kind of flowers common here, you'll be tested on the on these common flowers which are common here, the different china, they'll be tested on something different. But so this basically says that in practice, the successful learning algorithms are very broad, broad spectrum.
Okay. They don't just work for the uniform distribution. Okay. And then the third thing, which is a bit more sophisticated, is that so when you're learning, the learning algorithm gives you a hypothesis. There's a computational representation of everything else. But the classification, the teacher is just a function of you don't look inside the teacher.
Okay. So it's just a behaviour. And so in practice it's, you know, so this learning algorithm is something you have in your hand, maybe a perception, a deep learning network, some sort of boosting. So something you have in your hand. But then the examples come and no one guarantees where it comes from. But often the use of this representation for learning and you've got no chance of learning everything you can represent.
But you're still successful. And the reason usually is that the examples come from from a weaker world. So is there something something simple about the world are learning from. So the. Like this. So the. Okay. So the mystery of why certain the surest X work well in practice is often that the tasks they're given have some simplicity in them, which is often very hard to identify. Anyway. So so this is a specification of of a formal model of supervised learning.
Okay. So. Okay. That was by way of introduction. Okay. So how. Okay. So. Okay. So I've got this model of inductive learning and we know that machine learning, which does roughly this kind of thing, is very successful. But the question is the type of problem is something about the intelligence. So is this all of intelligence? And so everyone would agree that the answer is kind of no or almost everyone agrees.
So what? What more what more is there? And so what we want to do, if you follow this approach is to we need a model of computation. So back learning is a model of learning, but it's not enough because just learning we don't think is enough. So know what can we do more? What's what should we add? And I will add because I think inductive learning gets pretty powerful phenomenon and we need to add to it rather than start from scratch.
What do we need to need to capture? And so the adage, which I've been going around for a long time and drawing in advertising is, is this line from Aristotle who said that all belief comes from syllogism or induction. And by which you mean something like, if you believe that, if you have a belief in your head, then you either deduced it's a selective syllogism or some sort of logical deduction. You use it for something else to you, or always by induction.
So induction means that you somehow from on basic empirical evidence, you generalised it somehow. Okay. And so, of course, he spent, you know, 99% of his effort on syllogism and didn't say much about induction. But so what's happened since, of course, Syllogism has become this big field of mathematical logic and formulas. His reasoning? So induction became this mysterious sort of philosophical field. But I think it's the issues have been clarified by machine learning and machine learning theory.
So as an example. So when I started. The question is question how come that, you know, children have seen different examples of of chairs, different parts of the world, and yet even a new chair, they agree on what's a chair and what's not. That was kind of a mystery. You know, there wasn't a good answer to that. But now machines can do this routinely. So asking this question when mystify anyone living now.
And the reason is that that machine learning theory gives an answer on what it means to achieve this. You don't have to perform well on this distribution. You've seen it's probabilistic anyway. So we we do have a handle on on this. And before I go on, just to say that there are some technological aims here. So so what I'm discussing will be how you want to unify a view, have a unified view of of reasoning and of learning, because at the moment they're very different.
You know, reasoning is a very classical reasoning. Classical logic is this very brittle kind of a mathematical theory. But as machine learning is of, it's this kind of robust thing of a different kind.
So we do want to unify them. But kind of the grand goal, if you can do that as foundational technology, is kind of to approach the central problem of A.I., which I believe is, you know, how you put into a computer knowledge, which at the moment is very hard to acquire common sense knowledge and be able to use it in the computer to to reason,
to make predictions, deductions, whatever. Okay. So to do the second, I can't imagine how you can do the second unless you take some unified view of what reasoning and learning are. It seems that if the two disparate things, it's a bit difficult. Now. So in modern terms, I suppose there's a debate. And so I'm basically I'll be saying that reasoning and learning are both important and we have to reconcile them.
Not everyone has to agree. So, for example, at the moment, there are some people who are so enthusiastic about machine learning that they think that a single black box machine learning thing will do everything and we won't need reasoning. Okay, so that's as have you. And other people may, may put reasoning high on the pedestal. And says, but putting it more simply. The question is, are other people actually deny that reasoning is real?
Other people who deny that learning is real. So certainly, I think 30, 40 years ago there were real learning deniers, people who thought that intelligence was all reasoning and putting learning facts and reasoning efficiently with them. They were suddenly learning deniers then and now there's some reasoning deniers around. But, you know, in this talk, I'll take a middle ground. Okay, so let's buy this one. That Aristotle on a cell phone. So.
Okay, so. Okay. So so most people can answer this question without too much effort. But the question is, you know, did you use pure learning for this or did you use pure reasoning or did you use something else? Okay. So the main contrast I want to give is that at the moment one has to argue a bit against people who want to do everything by a single black box machine learning thing.
So for example, the idea is that if you feed this black box, you know, a billion sentences from the Web, you know, maybe you can answer every question and the reasoning will go away. But I think kind of commonsense introspection suggests that to answer this question, you know, it's not that we've been exposed to thousands of sentences about Aristotle's property, okay? But we somehow knew some facts and we train together facts with you. So some some reasoning involved. So this is introspection.
But can we ask the same question of learning versus reasoning a bit more scientifically? So. So we want some somebody to do some experiment to do which tests this kind of issue in a plausible way. And so the problem which came to us, which is, I think, very natural for this, is called the work completion problem. And this essentially is that I'll take a phrase from a website or a newspaper, usually a headline, and I delete a word.
You have to guess what the missing word is, because this is kind of a test. It's quite a good IQ test because it's quite hard to do. These headlines are often quite succinct, just the minimum number of words to express what you want to say. And okay. And of course, it's important we took these headlines from is maybe from a world where you have no knowledge. So actually the ones are examples I have has happens to be from our English language Chinese newspaper.
Okay. So let's have a some examples here. So. This was whatever the year of the dog holds in store, pet owners will be lavishing more attention than ever on their. So you have to guess where the missing word is. And question is, could your computer program do it? Any children. Okay. Okay. So the answer was Peaches, which is a in a bunch of scientists in early, early 20th century American origin word for dog issue.
And so the question is, is this a hard problem, say, for a black box machine learning algorithm? And so my guess is that this is easy, because if you look up Google and you look for sentences with pet owners and their pooches in it, you've got tens of thousands. So this is an easy problem for a black box machine learning. Okay. Okay. So another one. China rises as a maritime powerhouse after snapping up profitable blank blank across the world. Fragrance and fragrance trade.
So. Yeah. Okay. So that's the answer. The good seaport terminals. Okay, good. So I reckon this is slightly harder that you probably to do some reasoning. You just can't do it by some sort of word because, you know, I didn't have any examples of that. So the hardest examples for where? Inductive learning is disposable as well, where there's some kind of news. So to understand the headline, you have to know what happened yesterday. Okay.
So, for example, one thing is retail sales are up 20.7% in second quarter. Okay. So so maybe again. So, okay, so the answer here is a cow. So maybe if you're an expert on conditions in the different parts of China, he could do it. But you need lots of knowledge and maybe recent news, that kind of stuff. But certainly, you know, if things depend on you news this morning, then, you know, having a billion sentences in your brain doesn't help you.
Okay, so. Okay. So the interesting thing about this thing is that the idea is that with this problem, you can test your machine learning system how how well it solves this problem. And so I think this problem is not bad. So this is so it's kind of a stand in for the Turing test in certain ways. So in one sense, there's something for the Turing test has many aspects, but one aspect is that the measure something, you know, how well do you perform compared to something else?
And of course, the other important thing about the Turing test is that he didn't say the intelligence depends on how well you play chess or how well, you know, chemistry, but it depends on general general knowledge of general stuff. So. So this missing word test is good on that. And so the learning theory, Pursell adds, is that certainly it emphasises that any kind of performance in any system like this is with respect to a particular distribution.
So it's hard to be intelligent if, you know, if you go somewhere where your knowledge is irrelevant. And also it emphasises feasible computation that we're interested in efficient computation, infeasible computation and controlling the error of your prediction and things like that. Okay. So we'll come back to this problem.
So I'm suggesting that if you tackle this problem of of common sense knowledge and learning and reasoning or whatever, this isn't a bad problem to test your system on because, you know, it's there's a ground truth. There's a ground truth. And it's about this general knowledge.
Okay. Okay. So so what I'm really coming to is my content, which is okay, which is kind of my suggestions for having a model of computation which can do both, both inductive learning, which I think is important phenomenon, and you can add on reasoning to it. Okay. So and with this combined system, if you do it well, which we haven't yet, that you can test it on on this a problem like this with competition problem. Missing work problem. Okay. So the question is, how how do we add?
Uh. Okay. So what is intelligent thinking? You know, what else do we do besides inductive learning? And how do we make this into a model of computation? So these models get, you know, kind of complicated. A story machine is very complicated. This is much more complicated. And it's justification is that you're capturing something important maybe, and that other ways of capturing it would boil down to the same thing. Okay. Anyway, so this is more a list of things you need to capture.
And so the first feature of it is this idea which is borrowed from cognitive science of a working memory. So this amazing thing we have to cognition, which is that while we have this enormous storage of memories of each instance, somehow we've got this small mind's eye which directs our behaviour at each instant and of this little world in front of us, what we're aware of, and we use this awareness to plan our lives, what we do next, what we do after the lecture.
And so now all our behave is channelled through the small window. And so what's going on here? So the explanation we're here will be is that we need to restrict the window for complexity reasons, for computational complexity reasons. And we as a model, we need to use it to get anywhere. Okay, so so roughly, this is how we formulated. So you wake up in the morning and your mind's eye is blank, but it's got some two free tokens and can fill it up during the day with what you're thinking about.
Okay. So you fill it out with the scene. So you think of your dog. Okay, so you think of your dog, and then you want it. You want to pick your dog, you want to see what is the dog like. And then you have a rule in your head which tells you that, in fact, dog's like bones. Okay. Okay. So, so. So somehow with your background noise from your big, long term memory, you can fill up your your mind's eye, too, with missing information.
So this is roughly what goes on. But but here we come to the first difference between logic and learning. And so the point is that this implication doesn't fit well with with like pack learning or any kind of learning because. I think when you do machine learning, you go got some target function and so you're learning to recognise an elephant. And so basically what you're recognising is in this is a, this is a sufficient condition of a picture to contain a reputation of an elephant.
Okay. So what we are definitely learning, if you do if you do supervised learning is an equivalence. Okay. So you have a maybe of a big neural network, perceptron or decision tree. And it's you want to predict whether what's in front of you is a or not. And on the left hand side is some some very rich, complicated, incredibly complicated rule. Tens of thousands of bits. Best to try, possibly, but it does contain a useful criterion of whether you know what's in front of you is a bone or not.
Like a it's a predictor. Hmm. Okay. So the proposal, the first step of the proposal is that are going to learn these things. So maybe for each word in the dictionary, you're going to learn a predictor in terms in terms of the other words. And this predictor can be any whatever you're machine learning out and can do depends on your computational resources about what you can do. And this is what learning rules are. They'll do equivalences, but the point is that you can use a equivalences to make.
You can change these together to make predictions. Okay, so, so using something like this, if the conditions in your C predict that this is a bone, then this is a very good thing to predict as a bone. And you can once you predict it's a bone, you can make further predictions about your seen using these equivalences. Okay. So so that's that's basically the idea is that you're this is your mind works your mind is full of not one black box neural net,
but tens of thousands. And somehow the predictions of these tens of thousands are can be used together in a principled way to make predictions, because that's the kind of the rough summary. Okay. So that's okay. First step. Okay. So so this robust logic is this the system for is this model of computation. And the first aspect of it is that it we're going to learn rules with equivalences which predict maybe every concept in the in the in the dictionary. Okay. Aspect one. Aspect to.
Well that we will have quantifies. Okay. So this is we have sun exists and whatever's bit like in logic, but now they mean something much more grounded than they do in conventional logic. So in logic you learn things like you know or man or mortal. But then what does that mean? Has someone checked out meaning throughout the universe? Probably not. So this quantum quantifies a bit. Bit? Kind of embarrassing, almost. So in this logic, the quantifies only refer to this mind's eye.
Okay, so you got what you're thinking about. Does something exist there in your mind's eye, or is something true for everything in your mind's eye? So it's a very, very local thing. And I suppose I should just point out already is that, you know, somehow this political calculus logic is hasn't worked out too well for I and there's almost something kind of embarrassing about it. So, you know, but certainly this rule well, obviously there are many things which your dog likes, not just the bone.
So there's something, something very simplistic about its logical expressions and the something brittle. But the idea that, you know, the idea of predicting what a bone is or possible versions of it, you know, you know, there's nothing mysterious about that. That's what machine learning technology does for you. Okay. So, uh, so we have some quantify as etc., etc. Um, okay. So third thing is, uh, what about consistency? So as I said, so we're going to learn all these rules.
Okay. So that's a rule we a lot of these rules. So what if somehow they if you train them together, you get inconsistencies. So logic is certainly hung up on inconsistency. Okay. So here we say, don't worry. You learn all these rules. We'll just live with the inconsistencies. And if you if the inconsistencies are important, you somehow learn your way out of it. Right. So I know the 1960s or seventies version of this problem used to be no longer is this what's called the Nixon Triangle.
So you learn that, uh, Quakers are pacifists. Uh, that's a rule. You know, Republicans are not pacifists. So these are two rules you go round with. And then there's example of someone called Richard Nixon, and he was both a Quaker and not a pacifist. So then what do you do? Okay, so here the answer is, don't worry about it.
Go around with your general rules. And if this counterexample is somewhat worrying, then, you know, your learning algorithm will say that, you know, all Quakers, Quakers except Richard Nixon are pacifists. Okay, so you you'll learn your way out of an inconsistency if it's important, but you've got no chance of maintaining consistency in a complicated world. Okay, so that's as easy. Okay. So rules will be learned. So instead of learning to recognise elephants, we're going to learn rules.
And so these rules will predict in design the mind's eye. And then this probably possibly the correct sense. And we will look for rules which are highly reliable. Okay. So we've just learned rules. So as aspect for. So a more subtle issue, actually. There's a lot of discussion is was this distribution business. So here we do get rather kind of strange philosophical problems. So as I said, this question of how come we agree on other areas?
Although we've seen different examples, that's kind of has some history, but in the end it's not so mysterious. Like we could believe it. There's one distribution here, but then it gets more mysterious. So we've learned that to know, uh, you know, Aristotle lived a long time ago, well, etc., etc. So how we, how we use that. He didn't have a cell phone. So when we learn these general facts, it's not quite clear what the distribution is, you know?
So it's a bit lost. But but you have to take a stance on this. We have a model of computation. You have to kind of commit yourself. So the following is just a detailed version. But okay, but the short term version is, is, is this that maybe this is the central thing is that. So do we. Oh, yes. Okay. So the central model is this. The idea is that you've got this very long term memory of of lots of rules. You got very big brain, very complicated world outside.
But what saves us makes cognition possible is that there's this kind of funnel in between your mind's eye and somehow the examples of the world you see, you summarise very as as a subset of sketch or caricature. And then within the simple scene, you see you you apply your rules. Okay. So if I look out and I'll probably see three groups of seats, I don't I can't see every individual. So we we apply these rules to simplified scenes. Okay. Like this. Okay. So. So what is this? Well, distribution.
So, very roughly. Just to persuade you that there's a way of committing yourself. Although could persuade you. Persuade you that this is the right way. Isn't that would take more time. So the idea is that for each scene in your mind's eye, you think of something. There are all these features. So everything is true or false. But the whole essence is that the world in this game is in the description is incomplete.
Okay. You think I want to go home? And then somehow you fill in this scene about how you're going to. How. What's a reasonable way of going home? So. So in this mind's eye, there's very little which is specified. So most of it are stars. So some are definitely. Yes, some are. Definitely knows about two knows infinite list of stars. Okay. And so this is that the world sometimes doesn't specify the value of a feature.
Most of the time it doesn't. And again, so going back to the A.I. from a long time ago. So again, the famous paradox was that this is a bird called Tweety. I'll tell you, it's a bird. And I ask you, trees is a bird. Does it fly? And you say yes. And then I'll tell you the truth as a penguin. Then you change your mind. Okay, so this is some sort of paradox, if you think of it in any kind of reasonable logic.
But again, in this formulation, uh, almost without doing anything, there is no paradox, because the idea is that if I tell you something as a bird and I don't comment on whether it's a penguin, then in fact it's probably not a penguin. If it was a penguin, I'd probably bother to tell you. So that is that there's a distribution of examples you've seen and the. And if something is mentioned, that may be useful information.
Okay. So, so the incomplete specifications, uh, solve some, uh, paradoxes already. That's a comment. Okay. And then the game being played. Is that so? You've got your mind's eye. Uh, some things are, yes. Talking about your dog? No, it's not a cat. Most things are unspecified. And then there's one thing you want to predict or, you know, what is your dog like? So a question mark is like a force predictions that there's a ground truth.
Maybe there's ground truth as a probability distribution, you have to reply. So this is there's a distribute, there's a distribution out there. Okay. Okay. So now. Okay. So another aspect which I think is a is a deep one is this idea that. So once you we have you're learning many things. At the same time, there's this notion of hierarchical learning. Okay. So you're going to learn this word in the dictionary, that little word in the dictionary.
And so what happens if you only understand this word and you understand the second word in terms of the first? Or like if you go to a math, scores that are different concepts and someone. Okay. So so if you only have understand the concept, is it useful to label to tell you a new example where that's half understood concepts is in that. So all the evidence is that if you half understand, stuff is not very useful. Okay. So it's very hard to learn things.
So it's very hard to learn a concept in terms of other concepts before you really understand other concepts very well. Okay. So this is one reason why if you just stare around in the universe, it's hard to learn a complicated concept like that. The planets go round in ellipses. But that's not so easy to spot from looking at the sky.
Okay, so. So, in fact, the way the system works, which, you know, first I thought it's a weakness and embarrassment, but I think it's probably inevitable is that these examples do have to come with with kind of correct labels. So it's, you know, the value of of universities is that you go to lectures and someone, you know, meticulously gives the exact labelled examples which are kind of correct. Okay. You don't just skim the web and try to learn something complicated.
Okay. So someone has to label the outputs correctly and also the features correctly. And if I say that's a yeah and a group or something, which is this is this then it has to be commutative. Not very helpful unless you've learned about commutative means at the beginning. Okay. So. Okay. So so that's an aspect of this of this model of computation and. Okay. And that's also a lost aspect of what I want to emphasise is that so this is different from making a public stock model, model of the world.
This is something which is kind of avoid some of the complications. So idea of probably personally correct is that you're assuming that the things you're going to predict are close to probability, one of being correct. I'm not in the business of estimating probabilities point three and point five and point seven and computing with them. There's little, little evidence that humans are any good at that. And in trying to understand cognition, we somehow have to avoid that.
Okay. Okay. So that's seven features. Okay. So we. Okay, so we're going to learn the rules using whatever it is you like. I mean, that's. That's the parameter. Okay. So I mentioned currently general features in a second, maybe one with one dictionary. And you. Okay. Okay. Good. Okay. So I mentioned this word missing word problem. So a long time ago. Well, a while ago with the law is this, Michael, which is an experiment.
This was ten years ago. A simple experiment. Small data set, simple algorithms. And so the idea was that we took a natural language database from the Wall Street Journal. We use some standard stuff from machine learning, from natural language processing. We used online dictionaries, word net services, and and the exercise was that from record we were going to learn rules about the world from single sentences. So to do it properly, we should be learning from paragraphs or more.
So the idea was we were trying to learn facts about the world, which are different from just syntactic features of which you can do by just by applying machine learning boxes. Okay. And the issue was so issue was testing my main hypothesis, which is that there's a there's even if you can learn very well, even if you can do black box learning well, is there added value in training and training these rules?
Okay. So. Here we are testing this hypothesis, and this is an example of the kind of rules we learned. So this is The Wall Street Journal. It's about business. So a typical word is is so we call a missing word. And the question is, is the missing word price typical word you find in Wall Street Journal? Maybe. And so for each word, as I said, you have a predictor. Predicting some big, enormous, big mess.
But suppose it is predicted for you, whether messing with its price and the machine learning algorithm we used was essentially close to perceptron. So we learning inequality, a linear inequality, but the features were these compounds features. And so the idea was that if somehow you find the structure in the sentence where there's one word X, which the word was bargain and the sentence was telling you that this bargain lowers something,
then you should deduce that what it lowers is the price. Okay. So bargain lowering something is good evidence for the missing well being price and competition also known as price. So lots of independent evidence which could add up to decide whether you're missing, whether it was price. But anyway, so you could somehow you through this data set at your learning algorithm.
And the aim was to learn the facts about the world. Okay. So this is facts about the world which were beyond just what things you could. Okay. So you learn facts about the world, which hopefully you could train together and then reach conclusions which would be beyond just a simple black box learning. Okay, so, so we did this and, you know, it's a kind of very small scale experiment. We got some results. So. So main thing was that.
So we have this first about 260 words which were targets which are quite frequently enough. And I think already we. This is ordered from left to right. So the words were our methods were most favoured were the right. And I think red was what the. Okay. So blue was just machine learning, red was machine learning post reasoning. And so for some words we really did much better. And with some others kind of everything was hidden in the noise.
And the general phenomenon here is that in machine learning, big data is very powerful. And once you start adding reasoning, you can introduce all kinds of noise. So pure machine learning is quite something to compete against, so it's quite hard to improve on it, but it's still possible. Okay. So. Okay. So I think I've got maybe two slides, which is a bit more technical. So this is what the robust logic thing is. So in our minds I. Okay. Because some objects to objects.
So the right hand side of the rules on just particles like bone, they could be relations too, like above or byes or something like that. So the left hand side could be any the the hypothesis of any learning algorithm we used in the inequalities. Again, they could have compound features like this that, you know, so this thing is true. If, you know, there's an object in your mind's eye as if every other object in the mind's eye, various things are true.
So having complicated features makes you learn better. But you've got enormous numbers of these. You generate them momentarily. So you got a trade off. And use any learning algorithm you like because it becomes propositional. You just plug in your learning algorithm. That's what you do. And then what you are guaranteed is that, well, these rules will be learnable by definition. And the main promise is that when you train together these rules, then.
There's some some promises. And very roughly the main promise is that if you've turned together the two rules and each rule is accurate to 95%, then when you train them together, the conclusion will be correct with it's probably 90%. Okay, so you lose accuracy the deeper the training, but it gives you some principle, principled way of training together, even two things. So that's the kind of main promise.
And the idea is that it seems that if you want to do logic on a learned on set knowledge in a principled way, in a big machine, in a, it seems hard to avoid such a requirement. Okay. And so everything will work in polynomial time. That's how things are defined. The only restriction you need is that the relations of constant parity so I can have above is above B, likes A, likes B. So there's a binary and but the costs will be go up exponentially with the arity of of the relations.
So we have to divide straight up the world into relations of constant parity. So this doesn't worry people because it's a reasonable requirement. Otherwise everything is polynomials. In fact, the number of tokens you have in your mind's eye. So psychologists tell us it's what seven plus one is two. So things are polynomial in that. So we know it's not exponential. So maybe you can have 20, maybe 30. You don't have to worry too much about that.
Okay. So. Okay. So the outcome was if you build a system on these principles, you would learn a lot is use lots of learning boxes, but they interact in a principled way. And what this would address, I think, is certainly acquiring knowledge, which is hard to acquire how to program. It has to be done by learning. And so hopefully we're building reasoning systems by programming failed because they're too brittle. Hopefully learning will get you out of the brittleness.
Okay. So make a consistent. Okay. So so I think what the reasoning, the Aristotle example suggests is that will be when we reasonably often reason in cases when there are few direct examples and that's what this solves and. I may be a general comment. So this is a very general issue everyone discusses. In the end, machine learning is the idea of explanations. Okay, so people don't like blackbox machine learning because they know no explanations.
So in this kind of system, it's kind of a half a solution because what we're saying is that we're going to we're going to have lots of black boxes, but each black box is going to predict something. You understand some word you want. You're going to choose the terms in which you want your problem. Understood. And once you've got a prediction, then you've got lots of black boxes there. But you understand which features you, you, you, you care about are being explained.
And so this idea of explanations being kind of only half a level, I think is, is it's quite appropriate because I think this is similar with human explanation. So if if you ask me, you know, why, why do I bring my my umbrella? I'll say I, you know, I don't want to go. I thought it was going to rain. And you say, oh, well, so, well, I don't want to get wet. But at some point, if you keep asking me questions, at some point I'll say, I don't know the answer.
Okay. So, okay, so our explanations are also only up to a certain level that we can explain what we think. So. Okay. So so computers won't be able to explain to you that they are soft give giving explanations also at some point. So this kind of system gives explanations in terms of what you request. And maybe beyond that is as is hopeless anyway. Okay. So. Okay, so I think. So by machine being educated rather than trained.
I mean that you when the machine learns it doesn't know how it's learned nowadays it's going to be used. So when you're train with one single machine learning books, there's a lot of knowledge goes into it. But the only thing this knowledge will be able to do is to predict exactly what you had in mind when you were training it. Okay. Now we learn all kinds of stuff in college and elsewhere, and then we can apply it to the new situation.
And this is very much like having, you know, many black boxes loading in parallel and having a principled way of of using them to make a prediction or an explanation and a new situation. Okay. So. Okay. Okay. So why me? Okay. Okay. So. Okay. Very quickly. And difficulties. Well, the main difficulty is getting good training sets. So in machine learning, obviously good training sets were very important. And the recent developments. This is a challenge. So. Okay, so where do I get trading material?
I want to know the colour of an elephant. I put these different options into Google and I find this. So then I decide, well, this is better data than this. I want to go to Google Scholar. Okay, then I find this. Okay, so this is good. Okay. So, okay, so it seems that it's getting good data sets. Is is a problem. Okay. So let's forget about this. So I think what's needed is really big experiment, which basically needs big new data sets which can test something like this.
So just like know the big vision data sets produced six, seven years ago. A very important influential. Um, what we need is, is big enough data sets with good with good information which kind of challenges this requirement of doing reasoning in a broad enough um, context context to be to be interesting. Okay. Um, okay. So certainly. So the general summary is that what we're good at is throwing computational power at something.
And I'm suggesting that we should throw computational power at something where we know that there's a real phenomenon, that it's a supervised learning as one. But if we want to broaden it to intelligence, then we have to first decide what's the real phenomenon and then throw a computational power at it so that it.
