[Auto-generated transcript. Edits may have been applied for clarity.] So. Welcome, everybody. Welcome to this lecture. The series of lectures founding doctor Christopher Strangio, who was a leading figure in the history of our Department of History of Programming languages. Before introducing today the speaker, I want to thank our sponsors for Oxford Asset Management, who've been, uh, generous to be sponsoring this series since 2015.
So it's thanks to them that we can bring such excellent speakers, um, um, students, uh, researchers, to find out about their company, um, invited to meet with them after what we have thought we were. Okay. So how it's going to work is I'm going to introduce percussion, and then it's going to enlighten us and, uh, awkward questions. Um, they're going to be people running around with microphones. Um, before you do that okay.
So I'm really, really, really delighted to introduce Percussion and Garden, um, who is a super well-known figure in the foundations of science. And about it so low that I get a reminder every single day of that. Probably you didn't know that. And that's because I was I walked down the corridor. Um, there are probably some students here from the kind of quantum office, and, um, they have a poster in their office with a picture of the cast that you got in there.
So I see this every single day. And, uh, this picture was from March Fest, which was held here in Oxford in 2014 to celebrate a birthday. And so and, uh, honour and that I thought I'd read you a little bit of the blurb from Prakash. That's not all of it. Um, so it says of God has worked on a large variety of topics, including probabilistic and concurrent computation, logics and duality, and quantum information and computation.
Despite the enormous breadth of his research, he's made significant and deep contributions. For example, we entered this book a logic and also a real valued interpretation of the logic to capture equivalence of probabilistic processes quantitatively. So we're going to have some think about that today because I say probabilistic by simulation.
And he's talked about that something new since uh, first is the because he's also been through a lot of, um, important work in learning and learning theory. So he was a founding member of the Reasoning and Learning Lab at McGill and also the Montreal Institute of Learning algorithm. So we're going to be getting some of that today as well. Okay.
So I'm not going to tell you all the accolades, but I just want to very briefly point out that there's actually two plus two time papers in this, which are to celebrate papers that can have a group of students, in fact, a one year period is a fellow of the Royal Society of Canada. I want to guess. Yeah. Um, founding chair of ACM Sigma. So, um, at this point, I will turn it over to you. That's wonderful. Thanks very much. You. Right.
So first of all, it's a thrill and an honour to be here. Uh, I spent a lot of time at Oxford. Two different one year sabbaticals, and I was trying to count up how many talks I've given on. I think it's over 20 talks that I've given here in Oxford. Never in this room. Never been greeted by a dinosaur before. So, uh, I'm very happy to be here. And I'm thrilled to see so many people turning out to hear me. Okay, so, um, let me begin.
I have to begin with Christopher Screechy because he made a fundamental contribution. He, along with Dijkstra and Hoare, with the people who really started thinking about programming mathematically. And this has had a huge influence, not just in programming languages, but in related areas like verification and so on.
And of course, Christopher Stretchy and Dana Scott founded the so called Scott stretchy approach to denotation semantics, which is one of the things that sucked me into computer science from my earlier profession, which was physics. So, uh, I decided to show you a lot more photographs of people than than Greek letters. So there's quite a gallery of pictures coming up.
So there are the whole concept of by simulation, um, arose from the work of these two giants of concurrency theory, Robin Milner and David Park. Um, so Milner, the first one who conceptualised it and came up with the first description, but Park gave a much more elegant six point approach. And that's been the fruitful, uh, way of presenting by simulation, which has readily generalised to other subjects, other domains.
I was personally inspired by a number of people. One, uh, Dexter Kozan, who was one of the people who first began thinking about probabilistic systems on continuous state spaces in a computer science context. Of course, people doing Markov process theory, thinking about things people in computer science had been thinking about discrete Markov chains. But Dexter really started thinking about measure theory. And.
And related topics. And he taught a lecture course at Cornell when I was a young assistant professor there. And I sat through this lecture course, and I don't remember a thing about it. If I had, it would have saved me several years, which I had to rediscover many of these things later. Another significant source of inspiration was Bill Ovia, who actually formulated a notion of a category of probabilistic relations back in the early 60s.
Um, tragically, he died just a couple of years ago. But he is legendary for his incredible insight into different things. Wasn't always able to prove stuff, but he was always able to connect stuff and and have, uh, other people fill in the gaps. And here's a person who most people haven't heard of. Her name is Michelle Giri, and she's the one who gave a kind of monad theoretic way of thinking about the category of relations that, uh, formulated.
So let me, uh. Suspend the pictures here for a minute and talk about behavioural equivalence. Now, behavioural equivalence simply means what it says. How do I know that two systems are behaving the same way? And in particular, we can look at two states within a system and ask when do they have exactly the same behaviour? Now of course, this depends on what aspect of the behaviour one is able to observe. So if you're observing the older each state that that means one thing.
If you just observing certain reactions or certain things being emitted outputs, then the notion is different and we will see variations on this idea. What we do we want this equivalence relation to guarantee. Well, uh, if two states are equivalent, we should not be able to see any difference in their observable behaviour. This is kind of repeating what I have already said. It's trying to capture some notion of observable aspect of the behaviour and from the temporal point of view.
If we start out with two states that are equivalent, they should stay equivalent as they evolve. Otherwise, one of the observations you can make is aha. They've evolved into different, noticeably different things. And that certainly could not have been equivalent to start with. So let me just indulge in a bit of history. So the some of the ideas go back to, uh, Kanter and the back and forth argument.
Now, all of you know, Cantor's diagonalization argument, but he actually proved many things, one of which was the, um, only countable dense linear order without end points is the rationals. And the argument that he gave to prove it is essentially, uh, what he called a back and forth argument, very similar in spirit to what we will see in simulation. So I like to mention that as part of the pre-history of Pi simulation. Yeah. Uh, sorry to all of you asking a question. Taking pictures.
Yeah. Take a picture. Okay. So, uh. The concept of lump ability was mooted in the Markov chain community in the 60s, and this word is very evocative. It means when things are equivalent you can lump them into one state, thereby reducing effectively the state space with which you are working. So they gave this very nice name lump ability, but it didn't catch on in the computer science community.
Uh, Milner and Park in the 1970s and 1980s formulated the notion that we will call by simulation and I will, uh, and they defined it in a non-deterministic setting without probabilities. So I will walk you through these definitions eventually. Um. Yes. So they formulated these concepts and there was the subject called process algebra, whereby people formulated different processes using algebraic notions and reasoning occasionally about them.
In fact, there was a time when every country in the EU seemed to have its own process algebra. Italy had half a dozen, and at some point it became clear that one needs deeper concepts than just the syntax of these algebras. Late in the 80s, um Kim Larson and Honest Co formulated probabilistic by simulation for discrete systems.
So now instead of just saying things can transition um transition non deterministically, you actually have probability distributions governing the behaviour of transition systems. And what's the relevant notion of pi simulation in that setting. Later on people added temporal concepts. So they were continuous time Markov chains Petri nets and then the uh probabilistic sorry performance evaluation process algebra. Is that right? Yeah. Which Jane Hillston sitting here formulated in the early 90s.
Carrying on with a bit more history. So this is where we entered the picture. As I said before, I was very keen on looking at by simulation of Markov processes, but on continuous state spaces, and this was done originally in collaboration with my outstanding student Jose de ogni, and I was able at the Imperial. Later on, we moved on from equivalence relations to metrics.
So rather than just saying things are equivalent or not equivalent, we started talking about how different are they or how how similar are they. And there was a very beautiful fixed point presentation of the work we developed due to fan Bruegel and James Morel, who is sitting somewhere in the audience. So many things happened at Oxford.
Uh, now in the machine learning community, people were looking at Markov decision processes and without knowing that by simulation had already been invented, they reinvented it. But then they become became aware of the earlier work and cited it appropriately. But essentially they were looking at Markov decision processes. But now one adds a notion of reward to the set up. And we were in the market for taking any equivalence relation and making it into a metric.
So we came up with simulation metrics for these Markov decision processes. Uh, many things happened over the years, but let me jump ahead to 2021, when we started applying some of these simulation metrics to the problem of representation learning. And this was primarily done by my former PhD student, uh, Pablo Castro, my master's student, uh, Tyler Kastner, who's still doing his PhD. Um, um, Mark Roland and DeepMind in London.
Uh, so that work that I mentioned in NeurIPS was partly very principled and structured and then hacked at the end. And that was a bit of a disappointment. I mean, but then we realised that the hack at the end actually had a nice interpretation in terms of kernels, which I won't talk about today, but that's something we very, very recently, uh, completed and published. Okay. Right. So let me begin with the technical development, which I'll try to do quickly.
And sometimes you'll see text in blue. That means you can doze off and ignore it. But if you really want to read it, you can. So we have labelled transition systems. That word transition system means you are somewhere and you move somewhere else. So where you are these things are called states. And. Um. You move. How do you move? By something happening. So these things are called actions or this sometimes labels and governed with associated with every action is a transition.
And what a transition does is it takes you from one state to another. And we indicate this with an arrow labelled by one of these actions. So these transitions can be indeterminate, which means you start from a given state. You perform a certain action, and you'll end up in one of possibly several different states. Sometimes, maybe it's uniquely specified, sometimes not. So there's the nondeterminism.
And it's, uh, nicer to read these label transitions with label written on top of the arrow, rather than this peculiar notation shown on the other side. So here's an example that Robert Milner invented to explain these concepts when they were first introduced. And I'm sure all the verification people have seen this far too many times. I can give you a nice brainteaser to work on while you while I'm explaining this, but maybe not. So this is the legendary vending machine example.
You have a vending machine. You put, uh, a cup in the right place, you insert money, then you choose between tea and coffee, and then the machine responds by dispensing tea or coffee. Coffee as appropriate. And so. This diagram, which I hope is self-explanatory, depicts the behaviour of this very simple machine. Note one thing about this machine is the interactive character of the machine. The agent does certain things like press buttons, insert money, and so on.
And the system also responds to these interactions. Now here's another one. And if you look at it, it kind of looks similar for a while. But then you notice a rather peculiar feature which is once you insert the money. Then the system immediately makes a non-deterministic transition. So here's an example of non-determinism in action. It goes to either this state or that one. Before you get a chance to choose. And once you're in one of those states, you have no choice.
You can either ask for T if you're there or ask for coffee if you're here. Okay, so this is a very draconian machine that decides what you are going to drink today. Now of course, if you ask for these two things Behaviourally behaviourally similar. Of course not. They're manifestly different. And you can observe the difference by attempting to choose tea or coffee in the previous machine. You will find no matter what you do, once you've inserted the money, you can choose whichever one you want.
And here you'll notice sometimes it rejects your choice. That's certainly a very observable difference, but. Um, right. So one of them gives us the choice. The other one makes the choice internally. But if you look at the sequences of actions that you can get, they are identical. You put the cop, you put the money. Then you have a choice between coffee and tea. And in the language of regular expressions, this is what the possible behaviours of both machines.
In short, these strings are not capturing the branching structure, which is essential to distinguish these two machines. And hence, you know, old fashioned automata theory and language theory is not going to do anything for you in analysing these concepts. So we need to go beyond language equivalence. So time for a formal definition.
We have. Two states S and T. This dotted line connecting them is supposed to suggest that the biosimilar is going to behave exactly the same way, and what the formal definition says. It's written down here with all the symbols, but it says if this one, whatever action you choose, this one will go to some state's prime. Then there is some state d prime such that if you do the same action on T, it will take you there. And these new states as prime themselves, continue to be related.
So there is the temporal persistence in the in the in this relationship. Also of course we want to symmetries this things which vice versa with f and t into changing their roles. So that's the formal definition, a kind of a step by step account of what it means to be behaviourally equivalent. And what are you observing in this setting, whether or not your system responds to certain actions? It might happen that the system says I'm pressing the A button.
Then it just says, forget it, I'm not doing it. So your observation is what actions does the system make? Except you don't actually see the internal structure of the states.
Now we can make all this probabilistic by introducing discrete probabilistic transition systems so that just like the old ones, except that now the final state, instead of just being told it's one of these 17 states that you might go to, you're actually given a probability distribution with numerical quantities telling you how probable a certain final state is.
So just like before, we have states actions, but now associated with every action is this transition matrix or transition function that says start to the state's do the action a and there'll be some specific distribution over the final states. Now this model is reactive, which means all this probabilistic phenomenon is internal. We are not trying to associate probabilities with the actual external actions. There is no such thing as probability actually happens with this probability.
An action B happens with this other probability, just given that A happened. Here's how it responds. So let's ask about probabilistic by simulation as defined originally by Larsen in school. This was a very fundamental and seminal paper on the subject. So here are some systems. I hope the notation is clear. These S's and T's are states. I haven't bothered to label these ones because there are so-called dead states where nothing can happen.
These arrows are labelled by both an action as well as a number. And the way you should read this is if you are in state S0 and you do an action A with one third probability, you wind up in each one of these states. Uh. And once you're in the state S1 with probability one, you can do a B action and come to this dead state.
You can't do any of the other actions. You can't do A, you can't do C, you can only do B similarly, here and here you can only do C, and it'll happen with probability one that you will end up in this dead state. Now this one. Looks very similar. And the question is is S0 biosimilar behaviourally equivalent to t0? And you look at it and you see what I can't find a perfect matching between transitions like I was doing before.
And then you realise, oh, what you need to do is view this S2 and S3 kind of as a lump in equivalence class and add up the probabilities. Then this particular combination. Is exactly behaving like T2, and you can see from T2 it's not just the input but the outputs that are behaving the same. Either one of those states does a transaction a transition with probability one to a dead state. That's exactly what happens here. The be. Uh, sorry. The this transition is perfectly matched by this pair.
So, in fact, once you realised that you have to add these things. You do have a notion of by simulation, but you can't just use the arrow by arrow matching definition that you had before. So you need to add up the probabilities. And here's blue text formalising this. But I will skip over it. Now. Markov decision processes are probabilistic versions of label transition systems, but the, um, the uh final state is governed by a probability distribution.
No other indeterminacy. And now there's a reward associated with each transition. And what we observe are the interactions. Does it accept an action A and what is the reward. So we don't again we don't see the internal states. So again the formal definition looks like this a set of states a set of actions associated with every action. A probability function that says if you do action A on state's, you'll get this distribution over the final states.
And here's the reward function that says if you do action A and state's, you get a certain numerical reward. Sometimes they make the rewards live in the range zero one, sometimes on the positive, whatever doesn't make fundamental difference. This, uh, reward. I've made it a deterministic function, but without too much difficulty or doing violence to the theory, you can make the reward itself a probabilistic thing. There's something very different that happens in this point of view.
That isn't the case with how we dealt with these systems in the world of verification and verification. We wanted to say things like, no matter what this system will do, such and such, this plane won't crash, for example, um, but here you're doing something else. You're controlling the system by choosing the actions. And so we have this concept called a policy. So we have these, uh, Markov decision processes. And our policy says given a state's I'm going to choose the action based on that state.
But that choice may be. Probabilistic. So there's a distribution over actions. That happens once you choose a particular policy pile. And once you've chosen a policy, your MDP becomes just like a probabilistic transition system and rich with the reward concept. So the goal is to choose the best policy. And a lot of machine learning, especially reinforcement learning, is about the numerous algorithms that exist to find or approximate an optimal policy.
So a lot of the technology of optimisation theory comes into play here. Now what's the equivalence concept appropriate to this setting. So let be an equivalence relation. So you notice the use of the indefinite article and equivalence relation. It's just a uh something I'm going to put as a property that relations might or might not have. So given a relation ah, it might or might not be a by simulation relation.
So if R is a by simulation relation, if you've got a pair of related states s and t. And then for every action and every equivalence class of this are first the rewards match, and second the transition probabilities. Into the equivalence classes match. So the fact that we're using. Equivalence classes amounts to lumping things together so that you're adding up the probabilities appropriately. And this is the basic pattern. You have some kind of immediate matching that must happen.
Otherwise you'll see right away they're different. So the rewards must match. And then you want temporal persistence of this behaviour. And that comes from the second condition which is typically called the induction condition. Now this is in blue, but I'm going to read it anyway. So, uh, this definition is kind of intuitive and clear, but it can also be defined as the greatest fixed point of an appropriate relation transformer.
Now, as I said before, I want to move to the world of continuous state spaces. And one of the questions you can ask is why? And you can't answer. All because I'm having fun, which is usually the real reason. Uh, but the setting is real. You have software controllers attached to physical devices on sensors like robots or controllers.
Around the time I began working in this area, one of my colleagues got a contract to consult with the avionics industry, and they were very interested and it didn't really get off the ground, but at least we began that way. And I remember the first meeting that I asked them. How do you verify the correctness of your software? And they say, we have meetings. And I said, what if it's really safety critical software? We have lots of meetings for those. So did not really, uh, enhance my confidence.
But planes have not crashed for software failures yet. But nevertheless, it's an important thing to reason about. So the setting we're going to look at is continuous state space but discrete time. And you might question if you're going to make the state space continuous. Why not the temporal part. Well the main reason was it's much harder. But we're doing that now. But at the time we decided to stick with discrete time.
And it's not such an unreasonable thing because many of these controllers sample, you know, 50 times a second, maybe assign an NDA not to not say that, but anyway, sample some number of times per second. And based on this discrete temporal sample, it makes some decisions. So it does look even though of course dynamical systems are, you know, in the continuous time the software that you're writing thinks it's in discrete time.
So there were applications to control systems, to probabilistic programming languages, because once you put an iteration or recursion into a language with even with discrete probabilistic choice, you will effectively get, uh, continuous probability spaces by looking at the space of trajectories.
So let me make some more remarks about this. So of course these continuous state spaces can be used for reasoning, but maybe it would be much better if we could have a finite state version, whatever that means. Exactly. And in fact, we spent a lot of effort, which I'm not going to talk about today, about approximating continuous spaces with finite state ones, uh, with some appropriate notion of approximation.
But you could say, why not just discrete discretized right away and never worry about the continuous case? Well, if you don't have the continuous case, the original thing to compare with, how can you tell that your approximation is a good one? And secondly, if you've thrown away the original continuous model and you've just got a particular discrete one, you can refine it, you can coarse in it, but you can refine it further.
Okay. So, uh, let me go from that, those considerations to the mathematical ones and that the first one is the need for measure theory. Why do we have to do all this measure theory nonsense? Um, and the basic fact is, there are subsets of the real line for which no sensible notion of size can be defined. So size, of course, is the heart of measure theory. So you look at the real line. You have a notion of interval. You have a notion of length of interval.
You start getting more sophisticated sets and ask what is the analogue of the notion of length. You might ask, what is the length of the rational numbers? Turns out to be zero, but you can ask questions like that. What is the length of the Cantor set to zero even though it's uncomfortable? So it tells you that accountability is absolutely the wrong way of measuring size in this setting, where you do want some kind of geometric intuition underpinning all this.
So the fact is, until you invoke measure theory and sigma algebras and all the apparatus that goes with it, you can't just willy nilly talk about sizes and probabilities in a continuous setting. The more precise statement due to Vitale, is that there is no translation invariant measure defined on all subsets of the reals.
There are, of course, things that are not translation invariant, like the measure that says any set containing the zero has complete one and anything else has zero, but that's certainly not translation invariant. All right. So we went to this this continuous world. We use the same essentially the same definition of probabilistic by simulation. And we. Found some curious things happening. So first of all, even before us, right.
Going back to, uh, Robin Milner's work, he and Matthew Hennessey and independently, um, uh, God, who's the Dutch guy? Samson. Uh. So Christian, who is the one who did logical characterisation of Van Bentham? Thank you for Van Bentham always gets forgotten. Anyway. Um. And what they did was they showed that two systems are by similar if and only if. So that's very strong. They agree on all formulas of a specific logic that you can write down.
Now, this is very important because it tells you it's, uh, how to show things are not dissimilar. If you want to show things are by similar. Yes. You try and construct one of those relations and see if you can get the two states in question. Related. But if you tell me they're not dissimilar, how do I show that? I can't say. Well, I tried very hard to find a relation and failed. What you need is some result like this. And you can exhibit a formula and say, see, they disagree on this formula.
Now the formula formulas that appeared in the original logic had all kinds of things disjunction, negation specifically. And in the case you would and of course it was restricted to finite branching systems. And when we began, we were told by all the wise men in the field that we had no hope of doing an analogous thing in this kind of continuous setting. We lacked the kind of discreet structure that you would expect in logic. And so our logic had these formulas.
Just the atom true binary conjunction. And this peculiar modal formula. So this action is inside a little diamond and sub scripted by a rational number Q. And it says a state satisfies this formula if it can perform an action with probability at least q and wind up in a state that then satisfies phi. So this is an inductive definition of formulas.
And the result we were able to prove was that two systems are dissimilar if and only if they obeyed the same formulas of this logic L. So this is a striking because no negation, no disjunction, even in the continuous setting, which is certainly more than just the infinite discrete setting, uh, binary conjunction suffices and. I remember vividly that, uh, when my student first presented this. People jumped up very kindly because, you know, they were kind to.
It was a ten minute presentation at a summer school where the students were allowed to present. And, um, they very politely told her, oh, uh, this can't possibly be right. And she didn't know English very well. She was she's a French speaker from Montreal. And she said, be quiet at the end of my talk you will be convinced. So, uh, I was I wished I had the confidence to say stuff like that. Uh, so the important thing is, as I said before, no finite branching assumption, no negation.
This was what people were shocked by. They said, how can this possibly be? Uh, the proof is entirely different from the combinatorial proofs that existed before. Which is why they weren't subject to those, um, to that dependence on discreteness. So it uses tools from, of all things, descriptive set theory as well as measure theory.
Uh, right. So as I said before, such a theorem was originally proved for non probabilistic systems with finite branching restrictions and later by uh, a lot of time in school in the discrete probabilistic setting with a lot of restrictions on how much the branching was controlled and so on. And because of the power of the techniques we used, we were able to get. So you see, these results are even true for discrete systems after the fact.
But, uh, this does logical characterisation of discrete probabilistic systems. But it was only by going through this, uh, um, excursion that what one was able to prove it at first. Later on, of course, people did provide purely discrete proofs of it. And I must at this point pay homage to my former student. She's now a full professor. Jose, they are nay. Basically, she proved everything difficult. Uh, I proved many things incorrectly, and she fixed them.
So I call her my proof engine because she made. She, uh. I think we were a very good partnership. Because I would say, yes, let's do this. And much work like this. And she would say, no, no, no, it doesn't work like that. And sure enough, she would be right, but wasn't far from what I said. So it was always very, uh, nice working with her. Okay, so that's the story of probabilistic pi simulation. But we'd like to ask, is this notion of exact equivalence reasonable?
We are demanding exact numerical equality of these probability numbers. And that really is not reasonable, because a small perturbation in these numbers is suddenly going to lead you to say, uh, these two states are no longer bi similar. They had been. But then I changed this half to a half plus epsilon and the other half to a half minus epsilon. Now they're not dissimilar. Yes. No. So. We should have some metrical notion that tells you. How close are they?
And this is where we went to next. So, and Smolka proposed this idea, but they weren't able to come up with a metric that matched with by simulation. So we want a quantitative measurement of the distinction between processes. And I can usually I give a one hour talk at this point. So I'm going to skip that in lieu of all these Greek symbols and letters I'm just going to uh, tell you very heuristically what happened.
So remember this logical characterisation result. I'm going to lie a little bit, but please bear with me. So the logical characterisation results as if two things are similar. They will agree on all formulas. So if they're not similar there will be some formula they disagree on. So look for the shortest formula on which they disagree. If it's a huge formula.
If the if a huge formula is the shortest formula on which they disagree, then they should be close because it takes a long and elaborate experiment before you discern any difference. If they disagree on a short formula, um, they're very visibly different and they should be given a large distance. So, um. They may disagree on the rewards, or they may disagree on the probability distribution that results from a transaction. So we need to measure distances between probability distributions.
And there is a well known metric. Well known, but the name is always given incorrectly. Uh, it's called the Cantor. Which metric between probability distributions. So the reason I stroke through the Wasserstein is because that's what everyone calls it. But, uh, it's very weird because people who've never read the paper cited a 1969 paper by Varsha Stein, uh, in which they say Washington introduced this metric in 1969. And Cantor, it proved this thing about it in 1958. And you go, hmm.
Little bit of temporal anomaly there. Anyway. So it is actually introduced by Cantor, which in the 40s. This is such a nice and natural metric that has been discovered many times, and I'm sure Roger Stein discovered it without, you know, uh, looking at Cantor, which is work. But the fact is Cantor, which was 20 years earlier. So if the difference shows up only after a long and elaborate observation, then we should make the states nearby in the by simulation metric.
So one should think of these formulas not just as logical mumbo jumbo. They're actually descriptions of experiments that you can do to see how things behave. Right? Because they say, do this action. Do that action. See what the probabilities are. So all this can be formalised. It's rather more complicated than this nice intuition that I've been giving you. And it was originally done by us, by Josie and the and her band of followers.
Uh, but later, with a beautiful fixed point construction by fan Bruegel and world, who, uh, actually gave a much more fruitful definition. And all the work we've done subsequently follows their presentation of this metric. A master student of mine. Later a PhD student added rewards. So this is where we came up with by simulation metrics for Markov Decision processes.
And the student whose name is Norm funds. Actually proved an important property that was completely different from anything we had shown before, which was that the optimal if you look at the optimal value function in an MDP and you compare it at two different points, that difference in the value of the optimal value function is bounded by some constant times the by simulation metric. So the by simulation metric is not just some random thing.
It's actually closely related to the notion of optimal value function in reinforcement learning. So let me say a bit about reinforcement learning. Um, so we are often dealing with large or infinite transition systems whose behaviour is probabilistic. Again, the system responds to stimuli and moves to a new state probabilistically and outputs a reward, and we seek optimal policies. As I trained before, a plethora of algorithms and techniques exist.
But the cost of these things depends on the size of the state space. Uh, in a polynomial way worse than linear. And if you can shrink these things and get something that's effectively the same but smaller, it will help you a lot. Um, and so there's this other thing you can do called representations, where you say instead of actually using the exact state space, let's try and use something called features that capture aspect of the state space.
And can we use these representations to accelerate the learning process by shrinking the your search space? Um, um. So basically, for large state spaces, learning a value function like that is simply not feasible. So instead we defined a new space of features and try to come up with an embedding. Of your state space into the vector space indexed by the features.
Then we can try to use this space, which is hopefully much lower dimensional than S, to predict values associated with state action pairs. And what this phrase representation learning means. Learning such a high. Okay, so you can. Of course. Try to cook up with cook up fires, but it would be nice if your system not only used such a representation, but learned a good one. Now, these elements of M are basically features that are chosen by the person developing the system.
They can be based on any kind of prior knowledge or experience about the task at hand. Right. So, uh, it was actually Pablo Castro and Mark Roland at the Find and Google Brain, respectively. Actually the other way around. Sorry. Um, who started working on on this? And say they wanted to use the Cantor which metric at first to compute. But they found it expensive and difficult to estimate. And in particular, you can't find an unbiased estimator for the Cantor with metric.
So they came up with something else. They said. Let's use a slimmed down, easy version that that can in fact be estimated. And it's around this time that we joined them. Um, so we, as my student Pilar and I joined Pablo Castro and Mark Roland. So we had this kind of simplified metric. Closely related to the by simulation metric. But it is a very crude approximation. And in fact it's not even a metric.
So there are many distances out there that you know, you'd like them to be metrics, and they turn out not to be. KL divergence is a famous one, for example. So it's a new type of distance and we called it a diffuse metric. Maybe not the best terminology, but anyway that's what I called it. So here are the axioms for a diffuse metric. So the distances it assigns to two pairs of points have to be non-negative.
It's symmetric. It satisfies the triangle inequality, which gives you its geometric flavour. But we do not require the distance of a point to itself is zero. This is perhaps the most drastic violation of the metric axioms that you can imagine. Lots of people have imagined violating triangle symmetry. You get quasi metrics. Lots of people have imagined having things x, y at zero distance.
Those are called pseudo metrics. You can combine both those things and call them pseudo quasi metrics and so on. But this, uh, I thought nobody else had thought of. But it is not true. A couple of Polish people had had a metric like this. Um. And it's a bit shocking. It's like the Red Queen having to run to stay in one place. So, um, what can one say about such a thing? If you ask what's really going on? Perhaps I can give an example.
Supposing you are looking for crude ways of measuring the distance of probability distributions for the all kinds of things you can do, like control of it, etc. but one thing you can just do is you say, I have a so I have an underlying metric space. So you can say I'll sample two points. And measured their average distance. Good. Uh, maybe this will give me a measure of how far apart two probability distributions are.
And so, um. You can try this, but then you'll find if you have any distribution other than the direct distribution, you might be sampling points that are far apart. And it'll tell you the distance from this to itself is not zero. Uh, so it's a very it actually arises very naturally. And that's why you have to do clever things to get real metrics. So we call this metric the Meeko distance.
So as I said, it's a distance. It's not a metric. Now, nearly all the machine learning algorithms I've seen are optimisation algorithms. And one often introduces extra terms into the objective function that push the solution in a desired direction. So what we did was we said, okay, here's this proxy for the basic relation metric spiritually inspired by by simulation, but not exactly. And we defined a loss term based on the micro distance.
So Pablo Castro was the main experimentalist I have the last time I know the program was probably a five line program in ML for a functional programming class, so I certainly didn't do any of the implementation here. And he has a beautiful write up on his blog, where he describes all the experiments and the details and everything. I will, uh, say a bit, but it's the kind of thing I can't really answer questions about.
But at least I can say something about the setup. So we have basically two copies of the neural net. It computes the representation psi is the function approximator. And here is some kind of loss function. Usually this one is based on temporal difference. But then what we do is we will say we penalise you if your, uh, representations are too far apart from the Meeko distance. Okay. So the normal loss term is just this.
In fact normally you just have one of these. But we take a pair of these do the same thing. But then we say an additional loss term coupling these two together which is. Try and keep them within Meeko distance of each other. And you see, this is a kind of a generic thing that you can do. It's not just tuned for one other particular algorithm. And so we added this Mike Meeko last term to a variety of existing agents. All of those are available in the dopamine library.
We ran each game five times with new seeds, so 300 runs for each agent, and we ran each game for 200 million environment interactions. And what we tested this on was the Atari suite. So we looked at the final scores and the learning curves.
Uh, and so what we found was that on two thirds of the games, putting the Meeko loss term through a variety of gain, variety of agents, about two thirds of the time, it quite significantly improved the performance in terms of time taken to learn and, uh, performance thereafter. So, yeah, every agent performed better. Yes, that's true. Every single agent seemed to be improved by this. And those agents are quite different from each other.
So here are some results. But you know, just like some people are scared of Greek letters, I'm scared of graphs. So I'll skip these. Uh, and. So let me wrap up by mentioning a few other developments. Um. We did a lot of work on approximation theory, which I didn't talk about today because it's another 1.5 hour talk. But essentially what we were able to do is approximate continuous state systems by finite state ones.
And in what sense are they close precisely in terms of the by simulation metric, you can actually define a sequence of finite state systems that converges in that metric to the infinite state system. Now you might say, yes, metrical reasoning is all very well, but you know where algebra is, or many of us are. And we'd like to capture the notion of distance in an equation setting. So we introduced this notion of approximate equality.
So you take the equal symbol and annotate it with a small real number epsilon. And you say s equals epsilon. T is the same as s is within epsilon of t rather than s is exactly the same as t. Now of course you would like to realise this is not an equivalence relation. But just because it's not an equivalence relation doesn't mean it's rubbish. It's actually something called a uniformity. Uh, and from which one connects, one can basically produce an enhanced kind of notion of quantitative algebra.
And you can develop free algebra as you can develop the concept of equations with theories. You can develop completeness theorems. And you can develop ask. So what is the free algebra if you do this in that and the other. So there are simple equation theories you can write down. And if you ask what are the free algebras out of the blue they turn out to be. Probability spaces with the Cantor. Which metric? Even though your activities, doesn't say anything about probability.
So that was striking for us that it's really taking this metric reasoning and putting it in an equation setting. One has to be a bit careful. One has to take care of the what I call the ergonomics of reasoning, because now it costs it costs epsilons to chain these equalities together. So there is a lot of categorical mumbo jumbo. So I'm not going to tell you that. But, uh, for those who are interested, but I've given talks here before in which I've talked about this precisely.
One of the other recent things we've done is exploit symmetry. So that's where there's a group action on the state space. And you want to look at the orbit space, learn on the orbit space and pull it back. Pull back the policy you've learned to the original state space. This is a. Sounds simple, but wasn't so easy. Fortunately, I had yet another brilliant student who, uh, her name is, uh, Ruthie, and she's doing her PhD at Harvard now.
And so she took my vague and confused ideas and made them work. So that was a very nice, uh, line of work. I have yet another brilliant student. Uh, Florence Clark is her name, and so she is the one who started working on continuous space and time. So things like diffusion processes, the general class, are called attention processes, turns out to be much more difficult. It's not a cheap hack of what we did before because the notion of step is gone.
Things are flowing. They're not jumping. So, um, you have to do something else. You have to look at the space of trajectories. The theory gets much more painful. Fortunately, she likes pain. So, um, um, she actually managed to to, uh, hack her way through this. And as I said, uh, this Meeko metric, I mean, it was a nice thing and so on, but at the end, we had to do some hackery to get rid of the non-zero cells, distances and so on.
And at the time we were dissatisfied. It seemed to be a beautiful subject until the end point you had to say, well, we have to cut off the self distances in order to get something that looks more like a distance. But it turns out one can look at this from the viewpoint of kernels, and then it comes out in a quite a principled way. So this is, uh, work that we published in HTML. Ah, ah, yes. Last year.
Other work that I've done, uh, is with my former student Joey both who was a postdoc here, or is a postdoc here with the, uh, generative, um, learning groups. So we worked on normalising flows on hyperbolic spaces. I was really just a geometry consultant. They they did it. Uh, and we also did some work on generative algorithms based on solving stochastic differential equations on Romanian manifolds. So there's a lot of nice mathematics that one can do in this setting.
So let me conclude quickly, uh, by simulation, has a rich and venerable history. In fact, many people discover it all the time. I would say once every 5 to 10 years, somebody invents pi simulation. Mhm. But that just shows you the naturalness of the concept. The metric analogue holds promise for quantitative reasoning and approximation, and I think looking forward to pushing this line in future work.
And basically research is alive and well. And there are new areas where by simulation is being discovered all the time. So I'll just show you some some more photographs. These are collaborators. This is Gordon Plotkin, who's best kept in the cage. Here are some other collaborators. This is Norm funds. Uh, and this is Pablo Tyler. Vineet Gupta and Robert worked closely with me. I didn't put Josie's picture back here, but. Uh, these two boys and I work together with Josie.
We often call her Jose and her Indian gang. So, uh, all the approximation work was done with them. And I'd like to give special thanks to very, uh, to close and dear friends, Diana prick up and Doyle Pino, who I got at one time. I was their mentor. Now they're superstars. Uh, um, and, uh, they're the ones who inspired me to move from the area in which I was in, which was programming languages which I haven't abandoned, but which I have, uh, moved to also think about machine learning.
Okay. Thank you.
