Strachey Lecture: From classical to non-classical stochastic shortest path problems - podcast episode cover

Strachey Lecture: From classical to non-classical stochastic shortest path problems

Feb 06, 2024•57 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Professor Christel Baier delivers the Hillary Term 2024 Strachey Lecture Abstract: The classical stochastic shortest path (SSP) problems asks to find a policy for traversing a weighted stochastic graph until reaching a distinguished goal state that minimizes the expected accumulated weight. SSP problems have numerous applications in, e.g., operations research, artificial intelligence, robotics and verification of probabilistic systems. The underlying graph model is a finite-state Markov decision process (MDP) with integer weights for its state-action pairs. Prominent results are the existence of optimal memoryless deterministic policies together with linear programming techniques and value and policy iteration to compute such policies and their values. These results rely on the assumption that the minimum under all proper policies that reach the goal state almost surely exists. Early work on the SSP problems goes back to the 1960s-1990s and makes additional assumptions. Complete algorithms that only require the existence of proper policies combine these techniques with a pre-analysis of end components, an elegant graph-theoretic concept for MDPs that has been introduced by de Alfaro in the late 1990s. The talk will start with a summary of these results. The second part of the talk presents more recent results for variants of the classical SSP. The conditional and partial SSP drop the assumption that the goal state must be reached almost surely and ask to minimize the expected accumulated weight under the condition that the goal will be reached (conditional SSP) resp. assign value 0 to all paths that do not reach the goal state (partial SSP). Other variants take into account aspects of risk-awareness, e.g., by studying the conditional value-at-risk or the variance-penalized expected accumulated weight. While the classical SSP problem is solvable in polynomial time, such non-classical SSP problems are computationally much harder. For the general case, the decidability status of such non-classical SSP problems is unknown, but they have been shown to be at least as hard as the Skolem problem (and even as the positivity problem). However, for non-positive weights, the conditional, partial and variance-penalized SSP problem are solvable in exponential time with a PSPACE lower bounds for acyclic MDPs Speaker bio: Christel Baier is a full professor and head of the chair for Algebraic and Logic Foundations of Computer Science at the Faculty of Computer Science of the Technische Universität Dresden since 2006. Since 2022 she holds an honorary doctorate (Dr. rer. nat. h.c.) from RWTH Aachen. From the University of Mannheim she received her Diploma in Mathematics in 1990, her Ph.D. in Computer Science in 1994 and her Habilitation in 1999. She was an associate professor for Theoretical Computer Science at the University of Bonn from 1999 to 2006. She was member of the DFG (German Research Foundation) review board for computer science from 2012 to 2019, and is a member of the DFG senate commission on Research Training Groups since 2020. Since 2011 she is a member of Academia Europaea. Her research focuses on modeling, specification and verification of reactive systems, quantitative analysis of stochastic systems, probabilistic model checking, temporal and modal logics and automata theory.

Transcript

They. Okay, Welcome, everybody to the strategy lecture. So this is our distinguished only lecture, which is named for Christopher Straight G, who founded the programming research Group here and was one of the leading figures in the early history of computing. Before I introduce our amazing speaker, I want to thank our sponsors, Oxford Asset Management, to spend generously supporting this hearing since 2015, and they enabled us to bring a great future here.

I think some of you have done some things that researchers and speakers want to meet them afterwards, and that's okay. So it's a great pleasure to introduce Chris Buyer and Chris on super well known for her work in verification. And actually, Crystal is also very modest because the biography she gave me omitted all sorts of things. But luckily for me I used Google and discovered Discover for myself.

So I discovered, for example, that for her textbook, the principles of model checking has been cited more than 8000 times and is essentially the book in model checking. And she also has some very, very influential papers, including a work called Model checking algorithms for continuous time Markov chains, which again has really very, very many citations and introduces both the new logic and new model checking algorithms.

Crystal is the author of the verification column in ACM SIGGRAPH News and has a bunch of honours, including Academy Award Europa and two honorary doctorate from Orkin, and she's really active in supporting German computer science. So I found Crystal on just about every German computer science committee that exists. She's here today talking to us. The title of the talk is From Classical to Non-Classical is to cast the shortest path and welcome Crystal. But this letter. Yes, like that. Yeah.

Thank you. That's the Mike Rook. It does. It does. Okay. Yes. Thank you very much for inviting me. Thank you very much for organising this event. Thank you very much for sponsoring here. It's a great honour for me to give this presentation. Well I choose to a topic which has a long history. So. So indeed, first research was done 60 years ago, but in the end, for this non-classical versions of the shortest path problem then than it, it's more more recent work.

And before starting with the technical things, I would also like to say it is almost 30 years ago that I visited. We had Kostka in Birmingham. We just were thinking which year it was 95 or 96 and this was for me, it was time to get in touch with with questions concerning the verification of probabilistic programs or probabilistic systems. And at this time, I also read these old papers about the classical stochastic sort of path problem, and I would have never thought that.

So two years later I will give you a talk about this topic. So I would have expected that all the problems have been solved. Uh, yeah. So what is the question of oops of the classical tortoise past problem? So more or less the task is to construct a strategy or policy for traversing a graph, the probabilities attached such that the expected accumulated costs until reaching a target is minimal, so that the main problem is to minimise the expected costs or waits. A motivation. So this is a map.

Here is Grayson, where I live here in Oxford. And going from Grayson to Oxford. Well, there are many possibilities. Taking a car, taking a train, taking a flight. Then there are no direct flights. This is an extremely small airport that we have in Grayson. Then I have to change in Frankfurt or Munich or in Berlin or whatever. And there is a lot of decisions that have to be made and there's a lot of uncertainty.

So the flight might be delayed or the flight might be cancelled, which is indeed the case with my back. But flight have been cancelled. And so it's all finding one way shortest in the sense of shortest time for me in expectation. So this is one instance of the shortest path problem.

Now, indeed, in different communities, a I planning robotics, operations research, this is a very classical and fundamental problem, but also for the verification of randomised systems or distributed randomised algorithms. Then for instance, one might ask what is the expectation, the time to termination, the maximal or minimal time to fall for, for termination or expected costs, the maximal cost that can happen in the worst case.

But now that the formalisation of oops, the formalisation of these problems is using Markov decision processes, Now, I do not want to bother you with complicated definitions, which is not that complicated in that case, and the way the simplest way to explain them, at least I find, is to to do the following. We start with the classical transition system here shown on the left, where in every state there is a choice between different actions.

And in this context, we consider the choice of different actions as a non deterministic choice, so different than for finite automaton where we expect to have the same label for a non deterministic choice. Yet the choice between alpha and beta is considered to be non deterministic. The other extreme is a Markov train where we do not have non determinism at all. Instead, for every state we have a distribution for the successful states. And Markov decision process is somehow the mixture of both.

So for every state, we have a set of action, and for each we have a distribution for the successful states. I thought, Ah, this is missing because of this one. Maybe. Uh, yeah. So it was auto fault for Alfa. We have the distribution one for three force. And likewise we have, we have distribution for the beta and now in state f we, that choice between the different actions is considered to be a non deterministic one.

And having chosen the action in a state, then we have a probabilistic choice, which is you to be an internal decision which which is indeed the successor. Now to talk about probabilities or expectations in Markov decision processes, we need some inference which resolves the non determinism, and this is what we call calculus or policies or strategies.

So they take the history, what has happened. So the sequence of states and actions that have been taken in the past and then for the current state, they choose one of their actions, possibly in a randomised way. And doing this then you can, given such a catalogue, you can unfold the behaviour of that scapula into a tree like Markov chain, and that we have the full machinery to reason about probabilities and expectations. If you are not familiar with that trust, use your intuition.

I think this would be enough for this talk. And what is this? The shortest path problem we deal with waits, integer waits that are attached to the state action path. So now here is the definition. The only important thing is that throughout this talk I've been saying MGP, I always mean a finite state. MGP only have finally many states and finite many actions. No. And as schedule can be formalised as a partial or it's a function from finite passes to distributions over elections.

This would be the randomised case, and the deterministic case is where these distributions of direct distributions, which means that a single action is chosen as probability, one is chosen deterministically unique electron. And another important notion is that of a memory less controller. They do not look for the full history, they just look for the current state. And whenever I visit the same state multiple times, I make the same decision.

This is meant by memory loss and on the following slides I often use the short form notations named for memory less deterministic and often more or less randomised. And another important concept that has been introduced by Luca and Alcohol more than 20 years ago is the notion of an end component, which, to say an easy word, is a strongly connected MVP. And by saying Sup MGP, what we do is the following For every state we have a set of actions that can be taken.

And now we take just a subset of these actions. And this gives you MGP. The probabilistic choices for these selected actions remain the same. They remain unchanged. So if you look for the underlying draft, then it's indeed a sup graph. And for this being an end component, then this graph needs to be strongly connected.

And there is one theory that has been established by a look at how he sees this 97 and also in multiple papers, what I call the fundamental property, which is somehow the MVP honour. Look to the fact that in a finite state, Markoff trains. We know that almost surely so with probability one a bottom strongly connected component will be reached and the system will stay there forever.

And the MVP on a loop uses these end components state state that no matter which catalogue you choose, then the limit of almost all infinite passes will constitute an end component and tier by the limit. The meaning. If we look for the state action pass that occurs infinitely often. Now, here's an example. If you have an M.D. P Uh, and we have. It stopped working. Does it work here? Yeah. We have two end components.

Does this still work? Yeah, we have to end components and we have this state here. Oh, that's great. Uh, but. But this one for what? This one here is a trap, states a terminal state where no action can be taken. And then lemme apply to such MVP's that my train of thought trap states means that under each schedule, almost all maximal passes for either finite passes that end in a trap or the infinite ones. They indeed end either in a trap or they eventually realise an end component.

And by realising an end component I mean that the state should pass the infinite loop of constitute an end component. And for the special case where we do not have to. Okay, but we do not have any components at all. So an easy free MGP then allows Lemma simply states that with probability one, no matter which catalogue we choose, the system will end in a in a trap state.

So now the typical task of synthesis problems for MEPs or formalise the synthesis problems would be given such a finite state MEP Find a scheduler that's maximises or minimises the expectation of some random variable. So and the input of this random variable are simply maximal parcels. So one very popular measure is the discounted rate. Remind that you have these individual rates for the for the state can pass and now this discounted rate of a maximal pass.

We take the sum of all positions in that path. We take the rate which is collected for this with the corresponding state action alpha. And we multiply it with the ice power for discount factor or gamma, and it comes from value strictly between zero and one. And with government being strictly between zero and one, then we do not have convergence problems.

Even if the sum is infinite. Total waits is more or less the same, but you can think of it with discount factor one without any discount tractor. Now if indeed convergence is an issue for infinite passes. And I would say it is not that interesting with the total wait for infinite passes. Without any further outside constraint.

What has been done in the literature is either to consider only finite passes, either in a finite horizon setting, which means that we have some fixed bound for the length of the passes that we consider. And then we look for the maximal expectation of exactly and steps or the infinite horizon case where we have a distinguished gold state and we only accumulate the weight until we reached this goal state.

And with the site assumption that we only look for those catalogues that indeed reached this goal, state with probability one. These are the so-called proper catalogues. And there is another very prominent measure it random variable which is used in the literature. This is the mean payoff or the long run average. So what we do is we take the partial sums of these total weights.

So we look for all the prefixes, the finite prefixes of a given path, and we take the average for these finite prefixes. This is this value here. And then we either take the limits or the limits of. For these values Now, given a single pass, then the limit and the limits up, they can be different values. However, what is known is that in finite Markov trains that makes no difference whether we deal with the limits or the limit. So in expectation, these are the same.

And this result carries over to two Markov decision processes where the maximum does not change. Whether we take the modelling soup and say for the minimum overall calculus. Well, and moreover, and I will not zoom into the details, there are efficient algorithms to compute a memory less deterministic scheduler where these the max or the minimum achieved, and these can be computed using linear programming techniques.

So now let's go, let's look or let's go more in the direction of the shortest path problem. And let's go look first for the discounted case. And here the key concept on the so-called bellman equation. So this is it was somewhere in the fifties. So it's quite a long time ago when he introduced this this bellman equations. Okay. So now we are in the disguises. Okay. So the task is to find a schedule that maximises the expected discounted rate.

And you go, You are the bellman equations. So the idea is to characterise the global maximum by local maxima in every state. So. S is a state. An X is a variable for the maximal expected discounted rate that can be can achieve from state as so if we treat state as an initial state. And then this value x equals the maximum of all the actions that can be taken instead.

And for each such action, we look for the weight of the state action per alpha, plus the weighted some of these values for the success of. And the rates are given by the discount factor gamma multiplied with the transition probability going from s when taking alpha to state t. And. This is a recursive equation, so X equals something. And T on the right hand side, we have two values for the successors.

So you can think of it as a function that maps C or vectors this one component for each state to two vectors. And what has been shown is that this function, thanks to this discount factor, gamma is a contracting map on a banner space, so it has a unique fixed point. And this indeed also is the key property to establish algorithms. So one of them is derived linear programming techniques.

Namely, instead of saying X is equals, the maximum, we say x is greater or equal, then all these values for each action that can be taken in in status. And to ensure that we have access equal to the maximum, we minimise all these values. So this is one thing. And the other thing is what can also be derived from this is the value integration.

So we trust. So this function, if we start with an arbitrary value or zero in every component, and then we simply apply successively implied this function F and convergence is guaranteed. And that is also a technique I will not speak about, which is called policy iterations of successively improved memory, less deterministic scheduler, and eventually destabilises it in an optimal solution. So now this was the case, this one here for discounting. And remind for totally what I said.

This is essentially the same, but without this discount factor and indeed the bellman equation itself, they are for this infinite horizon case. They still look like this unless it is already one of these distinguished gold states in this case that is simply zero, but otherwise did have the same equation. And he was that this contract of gamma, which is one or can be dropped. But now things are more difficult. This before I said we have a contracting that. But this is not clear here. Not.

Not in all the cases. We don't have to discount any more. So this needs. Additional things to make sure that we have a unique fixed point and so on. And all this was. Let's hear we talk about the stochastic sort of path problem. And all this research is about these extra things that we need.

So just I skipped the outline at the beginning, but now here comes the what I just did was to give you some basics about the model we talk about and end these important measures, and now we talk about the stochastic sort of path problem. First the classical one and then later on on on variants. So this notion stochastic sort of past problem and the main results, they go back to Betsy and Alex from the year this is very small.

It is 9 to 1 in a paper published in Mathematics of Operations Research, and in this paper they have a very nice section on the related work or summary of the work that has been done at this Times was summary of the state of the art. So here is this paper from 91 and it goes back to 62.

This problem has been investigated first, and they also nicely describe that even Apollo tries to introduce a new name for this because this was used under different names and the three different three other different names before. But they also explain that this is in the style of classical short of path problems which are known from graph theory. So now. Let's look for this picture here. We have a way to find out State Markov decision process.

And we have a certain gold state, which is the type of state that we want to reach, and we accumulate the rates until we reach this gold state. And then we want to reason about the mean or maximal expectation for this random variable. Now for the rest of the talk, I will consider the longest pass problem we have here to the maximum. The techniques for minimum are completely analogous.

So the first result, and this goes back to the late sixties was that in the special case where under every sketch you scapula or the minimal probability to reach the gold state is one. In this case, what have been proven is that the bellman equations, they still work. And they get a contracting rep. So who empty piece of this form out called contracting MIPS. And these are exactly the you see free notes that do not have any components.

And it was shown that there is a reduction to the discounted case in this case. So all this machinery for discounted rate works also for this case. Now the contribution of that take us to thecase was this requiring that under every catalogue we reach gold state. It's a very strong requirement. Now let's relax it by considering only those catalogues, what they called the proper catalogues, namely those that reach the goal of the probability one. But there might be others. So. And.

They also achieved that under certain conditions. We can still rely on these Belmont equations and techniques that have been developed for the discounted case, but they needed a second condition here, which is that that the total expected total weight under each improper scheduler. So each scheduler that. Does not reach the gold state with probability one. Must be minus infinity. And they showed that if these two conditions hold it, then the maximum is indeed finite.

And all this machinery that has been developed for the discounted case still works. So. So now the question is. How to track this condition. The second condition, the first one, this is this is simple. They agree to do it in polynomial time. But how to track this condition? The second one. And the next thing is a. So if these two conditions hold, then the maximal expectation is finite. But there can be other cases where the maximal expectation is finite. So it's not a complete criterion.

So what to do in this case is wait to find out how to check this and how to compute in these cases an optimal catalogue and the optimal value. And the first answer was given to these questions was given by look at Alpha Howell in his Conquer 99 paper, where he considered Markov decision processes where the weights have all the same polarity that they are all known positive or they are all negative.

And what he did was he provided a transformation to the case of peptic ulcer and to clear case so that these two conditions hold for a transformed and repeat. And the idea here for the case this is now for the case with all weights are non positive, so either negative or zero. Then you observe the following whenever you have an end component where every state action pairs meet zero, then you can collapse them to a single state. Then it's no longer an end component.

And. In that way, you can remove all these end components consisting of what I call zero end components. And what remains is an MDP, where each end component contains at least one state action pair. This negative weight and the other rates are non-negative, non positive anyway, which gives you that expectation. The expected value of the total weight is indeed minus infinity for each improper scatterplot. And in the case where all the rates are non-negative, then Alpha proved the following.

This makes the maximal expectation is finite. If and only if the weights in all end components are zero. So no end component contains the stated comparative positive weight. And in that case one can do the same thing. One collapses all the zero and components and the result is an MDP that has no improper scapula at all. The all that did the remaining to the end. If all these and components enjoy this, then after this collapsing the zero and components, there are no end components anymore.

And these these concepts can also be generalised for the case of arbitrary interrogates. So we can have a mixture between positive and negative rates. And let's have quickly a closer look in this criterion pool for finite nodes and and for this transformation. There is some cleaning up to make sure that from every stage we can reach the goal with state of this with probability one and that none of the states are removed and so on.

I will skip this. Then, after such a cleaning up, the result or characterisation of the finite notes looks like this. The maximal expectation is finite. If and only if the MDP has no great divergent end component and component is said to be very divergent if it has a schedule such that for almost all the passes along this catalogue, the limbs of these prefixes of the weights. Hence to infinity. Now, this observation then yields an algorithm for checking finite nodes.

Compute the maximal and components. This can be done with an iterative, strongly connected component computation in patriotic time, and then one can track this with divergence separately. To give you an idea for that, let's have a closer look. So. This was again, the definition of great divergent and components. They exist as catalogues such that almost surely the same look tends to plus infinity. One special case also called Pumping and Components.

The same thing holds, but now in a stronger form. Namely, even the live in goes to plus infinity. This, for instance, applies to the end component, which is your trust mark of train, which is completely uniform. So if interest is uniform, probability will stay in that state or go to the other state. And one state for one state, we have to wait plus two and four out of one minus one.

Then if you look for the accumulated weight on most trawler, it goes up and up and up and higher and higher and higher such that indeed the limits will be plotted infinity. Another extreme situation is what we called what has been called gambling and component. The limit goes to minus infinity and super goes to plus infinity. So we still have this divergence condition here on the right hand side. So this would be the case where this accumulated rate oscillates.

So it goes down and down. So up and down and up and down and so on. So now here comes an algorithm to track rate divergence. Assuming that we have a strongly connected Markov decision process and this assumption is justified because, as I said before, to take this finite ness of the maximal expectation, it is sufficient to analyse the maximal end components separately.

So we are in this strongly connected case. So we first compute the maximal mean pair off and a corresponding memory deterministic scheduler. If this mean payoff is positive. Then we can safely return. Yes, because just remember, the mean payoff was defined as the average of the rates that we collect in the first steps, and then the end goes to infinity. So if this is the case, if even this mean payoff is positive, then we are in a pumping case.

If the mean of of this catalogue or the maximum of is negative, then we can safely return. No. This is not the case. It cannot be divergent because the maximal expected mean pair of in all and components must be negative. So when we return, when we arrive here, I thought then need a tool, not three of flights. So between that expected mean pay of what a maximum of is zero. Now this catalogue sequence is a memory list. Deterministic scheduler induces a Markov trade.

Now this mark of trade, you can consider it as a graph. It has certain strongly connected components. And now look for those that are at the bottom that cannot be left. Pick one of them. So if this one is compelling and reminder, this means that the name goes to plus infinity to minus infinity, then we can safely return. Yes, because gambling implies rate divergence. The question is how to track this being gambling or not. This is an NPR problem in general. But here we are in a similar case.

We have a Markoff train. So this is indeed a Markoff train. And we know that the mean payoff is zero. But then being gambling is equivalent to the existence of cycles where the accumulated weight is positive, which again is equivalent to the statement that there are cycles where the accumulated weight is negative. So and this is a simple graph condition. So this can be also tracked by graph. Now, what remains to consider is the case that this is not complete.

But then. But then the only thing what can still apply is that all the state action pass in the end component, or at least PSC must have made zero. All the cycles must have zero before effort. The criterion is to check the existence of positive cycles, which is equivalent to the existence of negative cycles. And what if this does not apply? Then this means that the accumulated rate along all the cycles must be zero.

And in that case. What we can do is we can somehow flatten and such and components. Now, in the case of non positive and non-negative weights, I said we simply collapse them to a single state. Yet the situation is a bit more complicated. But if I look at the time, then I think I will keep this. If you like to see it, I'm happy to show it after to talk. No. So. Well in the end with this technique to remove.

And components are indeed back to exactly this picture of what I have shown you for the case for Fallujah as I was case. Not positive or negative weight. There is a transformation which is polynomial time to a new MGP that satisfies the constraints that are considered by Betsy us and is such that we can rely on their machinery and rely on all these techniques that have been developed for the discounted case. So let's look for variance and let's start. Yeah. This. A few questions.

Why do we consider variants? The first thing is. I said to you, we consider the maximum overall the proper schedule of low dose schedules that reach the goal stage with probability one. Is it really justified to ignore the improper ones? And what is the case if there are no proper schools at all? If all this catalogues fail the Golden State with some positive probability. And another question.

Just looking for the expectation, then this might not be the best solution if the maximal optimal schedule is a good expectation. Okay. But the variance is very, very high. Then there is a very high risk that concrete simple runs have the accumulated weight much less than the expectation. So in these cases it might be better to have suboptimal sub maximal expectation but a better, smaller variance. Looking for the variance. I find it a bit surprising.

This is surprisingly difficult. So a natural thing would be to have the lowest threshold values. I want to have at least this and that expectation. And I have a threshold and an upper bound for the variance. No such threshold problems, at least to the best of my knowledge. It is not known to be solvable. What is known is to achieve such thresholds or to satisfy such threshold conditions. One might need schedules that use both memory and randomisation.

But I'm not aware that there is any solution that is known. There is a partial answer, but for a much simpler setting, namely, if there might be difference catalysts that achieve the maximal value. And if you consider only those and you seek for very minimal schedule on the dose that achieve the maximum, then the situation is simpler. More or less one can use bellman like equations for the variants. And what makes it so simple is. In the in this with this question here, the expectation is fixed.

It is a fixed value that can be computed before and then it can be treated as a constant in this linear program. So all this catalogues under consideration here. They have all the same expectation and this makes it simpler. This problem. So another thing is to look at what have people done in risk theory. So there is a notion called variance penalised expectation.

So what they do in this context here we still want to have good high expectation is to take the supreme of all its catalogues that can be achieved for this goal function. Here we take the expectation, the expected accumulated weight minus some factor lumped on some putative value times the variance. So there is some incentive not only to maximise the expectation but also to look for a small variance value. For solution methods for these. I come to this in a second.

There is a partial answer. Another measurement also taken from risk theory is the so-called conditional value at risk, which has been investigated by Kaczynski and to be a second offer. So this value at risk would fix some some probability value or some percentage. Pete And you are interested what happens in the worst cases. So in the 10% worst case for this, you can define the so-called value of risk here. So where it swaps? What are the worst cases?

Let's let's say it in that way. And then you look here in this conditional case, what is the the the expected, uh, accumulated rate. In exactly these cases, these p worst cases. But that's what they do. Solution method. I will talk in a second. Uh, or so. This was already a conditional expectation. But also, you can look for this in a clear form. The conditional expected accumulated weight under the assumption that we indeed reach the goal state.

This makes it possible to relax this assumption that we only look for the schedules that achieve the goal of this probability one. Instead, it is sufficient to look for those that reach the goal state at all. And another variant of this one, which is conceptually close to this one, is what we what has been called partial expectations. So for this random variable, we say take the accumulated weight in case that we did reach the gold state and otherwise zero.

Now these problems look quite different, but they have some similarities. And I will briefly zoom into conditional expected accumulated rewards and then give you an idea of what are the similarities to these other problems. So, uh, the first thing, why should we interested in these conditional expectations?

Well, it's a natural measure. For instance, if you have probabilistic non deterministic programs, then you might be interested in the conditional expectation time under the assumption that the program indeed terminates. Or you might look for certain failure scenarios. And then what are the expected costs of repair protocols in these particular failure scenarios which can suit your condition?

And it has also been used for some kind of multi objective reasonings with cost utility to analyse the cost utility trade of its captive. It is more difficult than the standard, uh, stochastic sort of past problem because as you will see in a second, optimal memory, less credulous might not exist anymore. We might need history. And also this using linear programming techniques in this naive form, this one variable state won't work anymore. So let's look at an example, namely this one here.

So this is an empty pig where it is just a single state where we have a non deterministic choice as to where we can go with an alpha directly to the gold state. It's probability one. Or we can take action beta with the self loop here or going to a failed state. So let's look first for the two deterministic schedulers memory less deterministic schedules that always do the same thing when they are in state. So the first option is always take action undefined state as to then.

We have to be have trust. A single pass going to to pass is going to go with oneness from aethereal via one to goal. Then we collect the reward for this action, which is all. And the other one goes from zero to S2 and then to gold and into the reward zero. So the result is a half. But this catalogue, the minimalist, deterministic catalogue which always chooses beta in state as to then achieve a better conditional expectation namely are.

And this is because they stressed a single pass going to goal. While as one we achieve the value, the reward order to the weight are for gamma and the other possibility for this conditional expectation. We never reach goals. We simply cannot. So it was a beta is better than alpha. But we can do better. Let's look for the following schedule in state as to for the first and visits we take beta and for the end plus first visit.

If we have taken the self loop. If it happens that we have taken the self loop and times. Then we take action. So there is still this possibility. Going directly from zero to S1 to goal. This stands for this one half times are. Plus now one half times the probability to take this self loop and times. This is this value. And whenever we take beta, we own this great one. So all together we get to the factory in. Now you can calculate.

Then the result would be this. And this is better than or remind. This was the conditional expectation that we achieve with this memory less deterministic scheduler which always chooses beta. So this value is larger than our if and only if this number n is larger than R, and one can compute that the maximum is achieved when equals plus two. Now, what this example shows is. We can eat memory. So in this case, we need a counter for the number of we have taken this self loop.

And what is meant by your biting local reasoning is not sufficient. I trust that the optimal value is achieved for this cat. After our plus two visits, as to we take extra alpha. Now, this depends on all which is the rate of this action gamma, which is sitting here. So this condition is even not reachable from S2. Just looking for the MDP that is reachable from is too. We cannot see anything about the optimal behaviour in state as to.

This is what I mean here is with local reasoning is not sufficient. And there is also another phenomenon which does not occur in the in the unconditional case. For the initial state. The maximal expectation is finite. But. Oops. Prostate is too. If I consider this as the initial state, then indeed this maximal conditional expectation is infinite because I have the possibility to pump at infinity. And go then to two to the gold state.

So, uh, now, although in the interest of time, I will just mention the results and skip these, these details here. Uh, if there is a method, though, this was on the previous slide, there is. There are graph techniques to check finite. So assuming that we have tracked this and made sure that the conditional expectation is finite and assume that all rates are non-negative, either positive or zero, but no negative values. Then what can be shown is that the threshold problem.

So to check whether this maximum conditional expectation is beyond a certain threshold. This is a piece based hard problem and piece based complete in ethically case. Now for algorithms in the in the non to click case the way we have cycles. The key observation is there is what we called a saturation point.

So if there is some natural number such that whenever we reached a configuration state and look for the accumulated weight that has been accumulated so far, if this one is larger than this operation point, then from then on the best what can do is to maximise the probability of reaching a gold state. And the intuition is. The accumulated weight is already so high that that we want to save.

We do not want to go. The risk that that we fail to visit the ghost state, in which case we lose everything. That's that's the intuition behind this. Well, from this one, one can derive an exponential time threshold algorithm and an exponential time algorithm to compute this optimal value. Now, this was this was saturation point algorithm for the case of conditional expectation.

And now remember that I mentioned before other variants, the same observation with these separation points have been made for partial expectation. This is the result by my test group. The antidote has also been made for conditional values for risk. This is the result by gift making of them. And. It also applies to the setting of variance penalised expectations reminder that the goal was to maximise the objective expectation minus lumped times variants slumped up being from positive constant.

And he is also said to raise a point, but here comes some strange phenomenon behind that. Saturation point. The best thing one can do is to minimise the probability to reach the gold state. And the reason it's more or less intuitively, this variance is a quadratic function. And from some point on in this objective, the variance becomes the dominant thing. And so what one wants to do is to to minimise that from some point on.

One wants to avoid that the value gets too high. That's, that's the intuition behind that. Which at the same time makes this very polite expectation in this context a bit questionable. Earth and other taking other things flipped on the owner of a click empty piece with interest rates. Then it's not a problem because in the case we only have finite many pasts. They can be explored using purely space algorithms.

And for many of these problems, I think even for all AP space, lower bound have been established for this case. So now so this word restrictions, non-negative weights or etc. Now what is in the training case? And one indication that this is more difficult is that these values, the maximal expectation defined in one of these variants can be an irrational value. So one should not expect linear programming techniques as they are known for the unconditional case.

And indeed, what the result of Jakob Bauer in his Ph.D. thesis. What he has shown is that quite a lot of these problems are Coulomb or even positivity of heart. Now, this column problem is a very prominent and very old problem in mathematics to decide whether a recurrent sequence eventually hits zero or whether all these values are positive and the desired ability status is unknown for quite a long time. Ben, Do you know I forgot when this column problem of positivity has been considered first?

Or anyone. In the. That okay. In the 17th. So quite old problem. Um. So trust the site remark. Uh, well, I say I was surprised to see these results, that these problems are so hot. And Jaco introduced to my opinion, a very elegant proof technique. So he used. The same MVP get hit for for encoding linear recurrent sequences to prove school m or positivity hardness for a wide range of problems.

Although for other problems which have nothing to do with weights like the model tracking problem for frequency l t l in Markov decision processes or threshold conditions for termination probabilities in one counter MVP. And these reductions, they only differ in the truth of the in the choice of the initial condition. So let's come to the conclusions. So for the classical stochastic sort of past problem, the key development equations from which we have to.

This full machinery tool that can be exploited algorithmically. So linear programming techniques, value optimisation, policy, iteration, and let's say. To get there were first. These are kind of semantic site conditions that take us a tricky approach. But after some graph transformations, polynomial time transformations, one can reduce to this case.

Now for these variants that that I proposed here or that have been considered in the literature, I found it surprising that they are so difficult in the case of introverts, this positivity, hardness and quite a few of them enjoy this property that if all debates are non-negative, then one can exploit the non-monetary necessity of the accumulated weights along the prefixes, which gives you a saturation point characterisation and then X time or X based algorithm.

And one. Oops. One other remark. So this was for Markov decision process, which are also considered as one player stochastic games. Now in the case where you have to play us. Then let's say the objective of one player is to maximise this expectation and for the other player to minimise the awesome results by putting in better outcomes from 99. But at least to the best of my knowledge, this problem has not been completely solved.

And there is also this. This I mentioned this, this partial of expectations. This also amount of paper has been considered also in the case of of the game setting. But I'm not real for algorithms, for the conditional expectation or the other variance of the to have the shortest pass problem. Now, I almost made it in time. Thank you very much.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android