Assessing Personalization in Digital Health - podcast episode cover

Assessing Personalization in Digital Health

Jun 23, 202158 min
--:--
--:--
Download Metacast podcast app
Listen to this episode in Metacast mobile app
Don't just listen to podcasts. Learn from them with transcripts, summaries, and chapters for every episode. Skim, search, and bookmark insights. Learn more

Episode description

Distinguished Speaker Seminar - Friday 18th June 2021, with Susan Murphy, Professor of Statistics and Computer Science, Harvard John A. Paulson School of Engineering and Applied Sciences. Reinforcement Learning provides an attractive suite of online learning methods for personalizing interventions in a Digital Health. However after a reinforcement learning algorithm has been run in a clinical study, how do we assess whether personalization occurred? We might find users for whom it appears that the algorithm has indeed learned in which contexts the user is more responsive to a particular intervention. But could this have happened completely by chance? We discuss some first approaches to addressing these questions.

Transcript

So good afternoon, I'm very pleased to welcome the distinguished speaker of the House, John Murphy from Albayalde University. So, so Murphy obtained using statistics from University of North Carolina at Chapel Hill in 1989 on F2P of his detailed position in both good universities, including Penn State of the University of Michigan, where she was the audience professor of statistics.

She joined our values in 2017, where she told me also said his own personal computer science puts a. He's the world expert in experimental design quote intelligent sequential decision making on a particular interest in digital. If the work has been extremely influential on the coding issue, received numerous honours, including amongst many of us, the future well, electricity because an 18 obligated to go middle of the way with such good society in 2019.

She's also American member of the US National Academy of Science on the US National Academy of Medicine. On top of being a fantastic scientist, but the military has been doing a lot of work for the study's community and is the past president of the Amos on of the Boundary Society. Though we both held a do or distinguished speaker for the seasonal flu, we'll be talking about assessing personalisation in digital health.

Hanks, thanks for that introduction. Thanks for the invitation to speak with us with all of you can share my screen. Yes, you should be able to. Yes, yes, you should see it now. Just let me just fix it so I can see. OK, great. So, ah, this is work that we're engaged in right now.

And these are our first efforts in this direction. And it was motivated by our concerns that when you run an online algorithm in a in this case, these are digital health trials and you look at the results sometimes for some individuals, the results look just totally fantastic. It's like you personalised. And now everyone should use your algorithm. And the question, of course, is, is this spurious? So that got us going down this particular path.

And I'll share with you today what our first very first steps in this direction are and will be. This will be focussed on heart steps. This is a I'll describe that shortly. OK. Yeah. So I just wanted to mention, you know, this type of research involves large collaborative teams because you're developing an algorithm, then you're implementing the algorithm in a trial and then you're analysing know. And there's usually software engineers involved as well.

And I just wanted to shout out three individuals who are particularly who are really made a big impact. And that's Pung Liao. He was he's a postdoc in my lab. Kelly Zheng. She's a computer science Ph.D. student, also in the lab. And then Xie Yang Ji, who's an incoming first year Harvard Ph.D. student. So what I'll do is, first of all, describe heart steps and then we'll go on to the issue of personalisation. OK, so what steps is was funded to construct this activity coach?

And it's on your phone and individuals wear a wristband tracker. And it was. It's it's for individuals who are high risk of coronary artery disease. And there was there's three studies that are part of this, and you see the the first study was only six weeks and then the next two that ran into each other was nine three months and nine months studies.

And I say these studies are micro randomised, and I say that in particular, the last two studies are personalised and I hope as I go through, you'll see what I mean by that. It's not just just put a question in the chat and we can. I can make that clear. OK, so it all digital interventions, there's many intervention components. We're going to only focus on one intervention component, and that's whether or not to send a notification.

It would appear on the individual's lock screen or their smartphone, and the content of this notification is tailored to where the individual is at the moment, the day of the week, what's the weather like and so on. And you can see an example on the right hand side side, and this appeared in the morning. It was actually a very cold morning, so you can see that it's trying to get me to think about, you know, reframe my view of cold mornings and about walking to work today.

So all the little suggestions are intended to help you be more active wherever you are at that moment in time. And we want to decide should we send one or should we not? Should the algorithm send one? Or should it not shouldn't. And there's five times a day at which these notifications might be sent and those five times are user specific. They have to do with the way that individual organises their life and the reward, what's called a reward or in our world, an outcome.

A near time outcome is the 30 minute step count after this time at these time points. So and the reason why it's only 30 minutes is because the content of the notification is all about being active in that moment. So when you think about data from one of these three trials that I mentioned on two slides ago, the data, what it looks like is on each user. It's a time series of tuples. I'll call it state s action a reward or it's a whole time series.

And the number of time points depends on whether or not the user was in the three month or the nine month study or at each point at each time point. The sensors on the wearable, as well as on the phone, pick up the person's current contacts or stay, and then an action that is send the notification versus not is made by an algorithm, and we'll discuss that shortly. And in our case, we're only focussing on sin versus not sin.

And then after that, the sensors on the tracker note record the 30 minute step count, and we're going to focus on a log of a 30 minute step count, mainly because step counts are hot, right, skewed. And in particular, the notation I'm going to use throughout is the mean of that log 30 minute step count, given current state and current action and action is either one or zero send a notification versus not is denoted by this lowercase R of say.

You should be able to see my pointer here and I'll use this notation repeatedly throughout. OK, so now about the algorithm that was used online as the trial went on to determine whether or not a notification was set at each of the five times per day. So I'm just what I'm going to do is I'm going to give you just some small aspects of this algorithm because I really want to get in to the latter part of the talk where I talk about whether or not how to assess how well the algorithm personalised.

So I'll only talk a little bit about the algorithm itself, and I can't speak about it more if people have questions. So the these are online decision making algorithms, and the idea is to select these actions, send a notification versus not in order to maximise some sort of outcome. And in this case, it's law average of the step counts. And it's always in this world. It's always subject to both a number of constraints which are often expressed in a very qualitative way,

and you have to figure out how to quantify them. So one constraint is to permit what's called off policy learning after the data collection is over. So in the field of reinforcement learning, that's what this belongs to. There is a lot of interest in understanding. Well, if I had use some other way of selecting the actions, how might the some of the rewards behave? So you want to permit that kind of like those kinds of analysis after the data collection ceases?

There's also because of the area, the domain. This isn't a burden. User burden is a big issue and habituation that is when people no longer even notice the notifications. That's also a big issue. So these are going to these impose enormous constraints on any algorithm you're going to run and you can't use these trials, the length of the trials or just have to do with how much funding there is. So you want an algorithm that doesn't know when the trial is going to end.

So what we did in V2 and V3, that's the three months and the nine months study was we took we started off with a bandit algorithm, a Bayesian type of algorithm called Thompson Sampling, and we altered it in a variety of ways. And I'll just point out some of the ways in which we altered it. And the idea is this algorithm which is running it, ran on the cloud and communicated with the phone and the tracker in real time.

It's supposed to be personalising the decision as to whether or not to send a notification versus not. At each of those five times a day, so when you think of an online decision making algorithm as a statistician, I always think of these algorithms as being composed of two sub algorithms two elements. One is what people will call a learning algorithm. This is just an incremental statistical method, and the goal is to learn some characteristic of the multivariate the distribution of the data.

In this case, it's to learn the meaning of the log 30 minutes step count given state in action, and in our case, we used a Bayesian linear regression model. It's a particularly simple it can be viewed as a simple, a Gaussian process model with very simple kernel, and that actually opens doors to us in a variety of ways. So that was one element. So it's just essentially an incremental statistical method. In our case, it's Bayesian, and you can think of it. It's linear regression.

And then there was an action. The second element of this online decision making algorithm is an actual selection strategy, and that's that strategy is all about how are you going to use the outputs of the learning algorithm to select the actions at five times a day as the individual experiences? The mobile app? So what we're doing here is called posterior sampling, at least nowadays it's called posterior sampling.

I don't think when Thomson first invented this, he thought he was thinking in this way, but nowadays it's called posterior sampling. The idea is what you do is you calculate the posterior. So you let your learning algorithm is Bayesian, so you can calculate a posterior probability that the treatment effect that in the current state s. You calculate the posterior probability that the treatment effect and the current state X is greater than zero.

So that's already one. Minus RC row, and you calculate that posterior probability that screened zero and then what you do is you take that, oh, sorry, oh, you take that posterior probability and you randomise so you're randomising. It is the sequential experimentation setting, but the randomisation probabilities are tied to how the Bayesian algorithm anticipates the effect of sending a notification in that particular state to state the individual is in right now.

This is a greedy personalisation. So what do I mean by greedy this algorithm? This if you base your work on a if you take a banded algorithm and you work around that area, then you're not paying attention to the fact of the notifications on future rewards. And clearly, that's not a good idea here. But there's a bias variance, trade-off and if and in our case, we decided to just focus greedily on this time.

Would it be useful to send a notification or not? OK, so I just want to I just wanted to talk just a little bit about the learning algorithm, the first element of this online decision making procedure just to provide a little bit more context. So what we do is we use in this particular setting because of the high noise that one incurs in these kinds of real life, experimental and sequential decision making problems. We use a very low dimensional treatment effect model.

And you see it here, it's a linear model in features, and all of the features of state were handcrafted by the scientific team. There were there were five stages of five dimensional. And in fact, we always use informative priors. I am now totally. I was never a Bayesian before. I have now become a complete Bayesian with informative priors.

Forget about this not informative business and and the way we form our informative prior in this particular case and in general in a setting is you have a prior study and we did. We had Heart Steps V1 and we could use that study to form the prior four v2 and V3. And I want to mention some things about this prior because it's going to be important when we go on. So this prior, say this five dimensional and I'll show you the features later on a further slide.

But the first feature is just the the the overall effect of sending a notification versus not. And that was the only feature that had a positive say to that is the prior had a positive meaning for that feature. The mean for all the other features, the remaining four was zero. OK, so this is important to remember for later on. So we're starting off the trial with a prior that says we anticipate there to be a positive effect of sending a notification or overall across all states.

We also had a baseline week for each user, each user had a week where we just collected data on the user. We randomised whether to send a notification with probability 2.5 each of the five times a day. And now I want to talk just a little bit about the action. We did a lot of thanks to this algorithm, but I just wanted to mention that the parts I mentioned on the prior slide now I just want to give you just talk a little bit about the action selection strategy,

which was posterior sampling. And this was a sad case because I was very naive when I started down this path. And so we actually had to the first set of people that came in the first V2. We ended up not being able to use this part of the study indicate why that happened. So the what does posterior sampling does what it does. You calculate that posterior probability, the treatment effect is very zero and then you're randomised with that probability.

So look at number one. In some states, the posterior distribution of the treatment effect is it's going to be highly peaked and centred around zero. And so what that means is your randomisation probability will be average will average around point five. So we probability point five on average if there's no evidence of an effect.

You're sending a notification this makes no sense whatsoever from a domain science perspective, particularly if you're worried about bothering people and having them habituate to your messages. This is definitely not desirable. We didn't even think about this at first. And then in other states, you're getting a large amount of information. You're just your Gaussian posterior will be highly peaked around a positive weight for that state.

And you're going to think, whoa, you know, you really should send a notification in that state. But again, we got to remember this is a setting in which people get overburdened by having pinging their phone, pinging all the time. Do you really want to send that notification every time you're in that state? No, you don't really want to do that. So tell you what we did. We're trying to improve this, but we became engineers, we have to, you know, we have to put this into the field.

It has to be. And we also needed to permit off policy learning after the data collection ceases. That was the third thing. Third issue now to learn you must in any given state, unless you're willing to make a lot of assumptions to learn, you must be able. You must do. Sometimes you choose action zero and sometimes you choose action one. You can't just always choose one of the actions. OK, so what was our solution?

Our solution was to take that posterior probability and clip it and the little graph. I just drew that little graph and blue on the right hand side. And this is how we clipped it. So if the posterior probability that it was really a good idea to send that message in that state is above 0.8, it becomes zero point eight. If the posterior probability is around zero point five, indicating there's probably not much of a treatment effect. Oh, we send the notification with probability point two.

Now, why are we doing this with point two? There is a lot of evidence in this world that variability is therapeutic, so we always want to send every now and then we want to send a message just to shake things up. So though the lower bound is point to the upper bound is 0.8. And that's what these two values at the bottom. The two sentences at the bottom of the slide are about so p you. That's the upper value point eight. In our case, this is determined by our need to do off policy learning.

We can't end up being we can't have one because then we won't be able to learn off policy. It's a disaster and we don't want to. And from a domain science perspective, we don't want to overburden our users PRL, which is 0.2 in our setting. We don't we again, we don't want it to be zero. We have to be able to do all policy learning. But here also, that's where the health benefit of having some variability.

And we also are concerned, even though I'm not dealing with it today, non stationary is a big issue in this world. Some. And we want to allow in our after study analysis to investigate that. OK. So we ran this algorithm. I just gave you a very high level view. We actually did a number of other things to make the algorithm suitable for this type of a setting. It's over. The studies are over. Did we achieve anything like personalised digital health?

OK, so I'm going to have a whole series, I think there's like seven questions that I want to address as we go through and disbanded algorithm or it was a generalisation of abandoned algorithm. It was run separately on each of the individuals 91 individuals. And the way it's when I say it's run, it was used to choose whether or not to send a notification in each state at each time five times a day over the duration of that individual study.

So some first questions I'd like to ask is, you know, we start from hard steps we wanted. There was an overall treatment effect. And our prior pointed us in that direction, is there evidence from this data that that's the case? So that's not about personalisation, it's just on on average. And then the next is more closely related to personalisation. Is there evidence of heterogeneous effects and how do you think about it in this problem?

So here we are. We're thinking about is we have no idea what to do. So we we go back to the literature, the old, very, very mature literature and clinical trials. This is an classical meta analysis. So if you're into machine learning, you know about meta analysis, metal learning, this is not metal learning in machine learning, OK, this is classical meta analysis.

In fact, at the bottom of the slide have a reference to a really lovely tutorial that came about at the maturity of this area when this area had really matured 21 22 years ago, a very old area of science. So the idea here is the way we're going to think about it is each user we have 91 users is a clinical trial. This is how we're going to think in our head. Each user has their own unknown vector of true treatment effect coefficients.

So I subscription by zero because that's their true treatment effect coefficient. And I indicates user, Oh, and then I have to estimate each use. Each user has to have an estimate of that user's treatment effect coefficient. What I use is the vector of posterior means. And what I'm thinking in my mind is this was just a Gaussian. A Bayesian linear regression. This is just reg regression here. All I did was a rich my theta had. I is just a weight from a rich regression.

That's all it is. You can see what Zeta is. So wait. Remember, Theta II is a five dimensional vector, it's the treatment effect model. So in that in classical meta analysis, there's two ways that people think the first way is you say all I care about are the users are in their case, the trial, the trials in front of me. So all I care about is these 9:1, you don't care about anything else.

I only want to make inference about these 91 users, and what one does is one makes an approximate approximates the distribution of your. The estimates we derived from a rich regression by a normal it should have mean the true underlying regression coefficient for that user ie. And then there are some variance. And the variance has to do with the fact that we didn't observe this user over really long. We didn't assume we didn't have an infinite number of examples on this user. So the arrogance.

The second way you think that one thinks in classical meta analysis is population inference. So here you think my end users and 91 in our case are a subset of a population of users and we want to make statements about that whole population. And in this case, actually in this study, this made a lot of sense for us to think that way as well.

And the reason is because all of these individuals are from our patients in the Kaiser Health Care System in Seattle, and they had all just been diagnosed with stage one hypertension. So if the health care system was thinking about should we roll out an app for our patients who have just been diagnosed with stage one hypertension, this this type of inference would be relevant.

In this case, you make an additional assumption in this additional assumption is that as you vary from one user to the other across the population, that varies normally. And the main five dimensional vector of treatment effects the status of pop. And then there are some variation amongst these five dimensional vectors as you go from one user to the other in the whole population. OK. So must start answering my two questions that I posed.

I'm going to repeat the questions that my answer? The first question is in this vein of population inference. So is there some evidence of an overall average treatment effect on our 30 minute step count and in classical, the way one forms a statistic as you get a weighted average of your your purse, your user specific estimate? Okay, so I'm going to go through that. This is just I'm not doing anything special here. This is classical meta analysis.

So this little e that's a that's a vector all zeros, except a one in one place, and it's just being used to pick out one of the five. One of the members of the five dimensional vector of theta and the weight. So we get a weighted average in those weights or are the within use or variance, plus the variance from user to user. So it's classic this classical statistics and you weight your person specific estimate hours by these weights.

Now here's this little green table. It gives you the names of the features. So the first is the overall send notification versus not binary. Then we had a feature how many times recently it was an exponentially discounted feature of how many times recently we've been sending your notifications the next.

The third feature engagement was whether or not people were going more often to the app to track their their physical activity than usual location was whether or not they were in the structure, environment or not, and step variation was how variable their step count was in that same time period over the last week. OK, so you do this and you think, OK, you get chart, you do the statistics, you get your confidence interval and you think, Oh, this is great.

We have a confidence interval. It doesn't contain zero. That's lovely. And so there seems to be some overall effect of sending a notification versus not. On average, across individuals, it's not talking about personalisation. And then you go to that second row and you realise that and in fact, we anticipated this that the more someone has been notified, the less responsive they tend to be.

And in fact, this is a very large negative coefficient and the confidence intervals are very wide, indicating there's a lot of uncertainty. This pretty much kills the treatment effect, except the average estimate or except when the dose. The recent dose is very, very close to zero and to hit that point a little stronger. I'm going to look at a particular state here, and the state is the person is experienced recently, an average dose. They're currently engaged with the app.

They've been tracking their behaviours, they're at home or work in a structured environment and their recent variability in their activities. So I'm just going to focus on that state and I ask, well, what's the confidence interval for the treatment effect in that state? You know, so this is a confidence interval for the average across the population treatment effect in that state. And you see indeed, there's just not much going on. There's not a lot of evidence there, right?

It's depressing. So then we asked, well, what about heterogeneity between users is the better action user specific? And this is now all of a sudden we switch our hats and we start focussing just on these 91 users. And the test statistic here is based on the variation between users and their estimated regression coefficients in the treatment effect. Again, E is this is a vector all zeros, except for one one in one of the entries, depending on which coefficient you want to pick out.

And the average that you get a variation amongst each individual's estimated treatment of sex and the average is an average, of course, weighted by how variable that treatment effect is. And of course, this average is why it's not overt here, explicit, but it depends on how many how long that individuals in the study. So here's us using it, so the test to the test is a hypothesis of whether or not the users have all the users. Ninety one have the same. True. Treatment effect coefficients.

That's what this null hypothesis means. Five dimensional data. And here we are going from one data to the next five. And you see, there's enormous evidence that users differ a great deal. One from the other in terms of their own treatment effect coefficients. Lots of heterogeneity. Very interesting.

OK, so now what I'm going to do is I'm going to if you're familiar with reinforcement learning are bandits, you know, one of the things we always want to do is estimate the average reward and compare that average reward on different policies. So that's what I'm going to do here as well. OK, so on average, does the bandit algorithm select more effective actions, i.e. send a notification versus not, then the prior, because remember, the prior was informative prior.

We built it off a prior prior data on similar individuals exact same interventions. OK, so I just want to make clear what I mean by average, by how am I quantifying if more effective actions? By that, I mean, you get a higher average reward. So here you have the value function. The Sabaya for the Eyes user, PI is a particular policy for choosing actions. And this is just the expectation of the ICE users reward function, which is a function of state and action.

So it's this expectation is averaging over the states that that user experiences, as well as any stochastic city in the policy pie. And our policies are always stochastic. So it averages both over the stochastic city to the states that that user finds themselves in, as well as the Stochastic City an average. And we want to know disbanded algorithm produce a higher value, higher average reward than if we had just built our policy from the prior data and ran with it.

So what we're going to do is we estimate that average reward under our band it now under our band it algorithm or our generalised band bandit algorithm. It's actually the the it's posterior sampling, right? So the policy is changing with time. So that's the reason why there's a B1 through DTI, because it's the probability of selecting action changes with time.

Send a message. Changes with time and the estimate are of that value is just the average that you see in front of you for that individual. Now, the estimate of the average reward for a different policy, for example, the policy built off the prior. There is a way it used importance waiting. There's more sophisticated estimates now in the literature. OK, I just want to warn you, but this is a first round kind of thing. So we use important weights.

And you can see them to their on the right hand side to weight those observed rewards in order to estimate the average reward under the prior policy. So if we just built the policy from V1, hardships we want and ran with, it didn't do any. Try to do any learning here. Already, we should be thinking in our mind that prior policy, our subjective prior said there was an effect. Of sending a suggestion. But then when we went and did these analyses, it didn't look so great.

Not at least not on average. OK, so how are we going to do this now we're thinking again, we're still in that meta analysis world, that classical, lovely net analysis this world. So now the what do our status, what are they become? They're just one dimensional now and say to zero, the ICE users truth data is the difference between that users value under the band, minus that users value under the prior, the policy form from the prior and data hat I.

It's just the estimated. Or should these two things we don't, of course, we don't observe statuses. We don't observe the first term. This is unknown. So what we're going to do just same is classical, not analysis. And this is generally true under certain conditions on the data structure that say to hat that is this estimate or in values, the difference in values is approximately normal. It has some variance because you're estimating it.

And so we would we make this first assumption if we are only interested, we only make the first assumption if we're interested in only these one users are equal one to 91 and then we make an additional assumption that's the second one right here. I have my pointer right below it that the true difference in values varies from one individual to another in the population, according to normal distribution. Sort of like a random effects thing. OK. So the statistics are identical.

No difference. They're identical test statistics. So just the band an algorithm result in higher average rewards than the policy based on the prior. And in fact, it does. And the confidence interval doesn't contain zero. Now, this is not a big effect. Remember, this is average over the population.

Why do we think this might have happened? Well, the prior the policy built off the prior wanted us to send the notification a lot because it the prior said there was a positive effect of sending that notification. The prior mean was positive. But the but the band learns that on average across people, it's not true. So I interpret this as the effect of not bothering people too much, and every now and then it helps on average.

OK, so now let's focus on our Ninety One users, just our ninety one and ask is that difference in values that the band it versus the prior? Are those differences? In other words, is the benefit of running a band bandit? Does that vary from one user to the other? And here we are. These are this user chi squared. This is the second type of hypothesis test and second type of statistic, and it has this hypothesis test chi square distribution. It's incredibly significant.

A lot of evidence that the band it works that for some people, the band, it gives you very different results than the prior as compared to on average. So, OK, you know, this is all fine and good, and then we wanted to start looking at some exploratory work. And I found and we're getting closer now to the question. Question seven was the question that motivated this entire. Project, and it's still motivating the project, but first time I goal for sex Question six.

Our prior this informative prior built off partnerships, we one said there was a treatment effect. In fact, it said the treatment effect was pretty strong because it was very strong in hard steps to be won. On average, across users, does it appear that the bandit algorithm learns over time? Well, what should the banded algorithm learn on average across users?

It should learn on average across users, there's not much going on. OK, so the blue curve here is the actual from the real trial, the actual average posterior mean of the treatment effect in this particular state, the second bullet point gives you the state. It's the same state we looked at before as the trial progresses. So here you see that this is a cross decision points that x axis and then the y axis is that posterior me?

And if you can look close, I don't know. It's hard to see. But the blue curve starts at time. Zero around point four seven point four seven was the prior mean of the overall effect ascending suggestion versus not in that state. And we know from we suspect from our prior analysis analysis, I've talked about today that on average, that didn't bear out. And in fact, OK, so you see the blue curve. It starts to drift down. It's the average posterior mean, the treatment effect.

As the study progresses, it drifts down towards zero. OK, so I wanted to understand, was that drift really important? Was it significant, you know, in some way? So here's here's what we did. We we did bootstrap studies. So there's two thousand black curves. Each black curve is a bootstrap, is a is a bandit. Try a banded algorithm applied to 90 to each of 91 bootstrap users.

Because I'm going to talk to you about how we how we made a bootstrap user. So what we did was we took our original 91 users data. We subtract it from every user at each time point the posterior mean of the treatment effect at that time point. And we call the difference between the reward and we subtracted from the reward, the posterior and the treatment effect at that time point. We call that difference a residual. It's not mean zero because we just took away the treatment effect.

But so now after we did that subtraction on each individual, the 91 individuals, we have a time series of state residual state residual. State residual state residual. Then what we do is we we get a bootstrap sample of these 91 trajectories of state residual state residual. And on each of the 91, we run a bandit. If the bandit says choose action one, we add back in a treatment effect according to the prior.

If the band it says choose action zero, we don't do anything because, yeah, we leave it because the data is already had. So, OK, we do that for every person, so we have one bootstraps study of ninety one users of it is run on each the ninety one. And now what we do is we get this average posterior mean across those 91 bootstrapped users. And that's the black line. A black line is the posterior mean, evolving over time. Now, under the ground truth, it's the priors, correct?

And in fact, you see if the price were correct, all these lines, it's a big mass because this is a stochastic algorithm. It's a big mess. It goes through time, but it's but the blue line. The truth. Rapidly deviates from that mask and goes below, so the actual study very quickly learns that the prior was incorrect. OK. So I'm getting close to the end, so this so this again, as I said earlier, this is the we had a number of situations like this.

And this is what motivated all this work, and I think to me, this is really important because right now we have so many examples of AI producing horrendous false results and we cannot have I mean, that's our, you know, as a statistician, I want to make sure that whatever we say, I want to try and give you some measure of confidence with whatever we say. OK, so I'm going to show you a user data who who exhibited very interesting personalisation.

This was not the only use for this one. So remember, this is going on in Seattle. The x axis is the date in the study. This this individual or join the study in November 15, 2019. pre-COVID, they exited the study. This individual exited the study at the beginning of August 2020. In the in the middle of the pandemic, Seattle closed down in March. OK, so let's let's talk about the y axis.

Each dot is a ratio, it's the ratio of the posterior mean of the treatment effect in that that user's current state divided by the posterior standard deviation of the treatment effect in that user's current state. And the reason why we're graphing this on the y axis is this directly leads to the prob the posterior probability. The higher it is, the higher the posterior probability of sending a notification, the lower the lower.

So OK, so each dot is so there's a dot for every one of these five decision times, you know, a lot of dots are on top of each other. Now the dots are have two colours. The blue dots are when the current the user's current state indicates the person is engaged and the red dots, or when the user's current state indicates the person is now engaged here means you're just watching your app because the app had all kinds of things you could do on it.

And the red state means you're just not watching the app quite as much. It's fascinating because in general, if at that context and that in that state you haven't been watching the app, then the algorithm says your treatment effect is much lower than if you if this individual had been. Watching was engaged. That's the blue. This is glorious. I mean, I show this to a domain scientists, and they love it, right? Because this is what it means to personalise it with higher probability.

When someone's engaged, they're sent a message and with lower probability, they're sent a message when they're less engaged. It sounds wonderful, but is this even? I mean, you know, this is a stochastic algorithm. Things happen by chance. OK, so here's our exploratory data analysis to think about this, and this is only a first effort. There's other ways to think to pose this problem. OK, so the blue curve here again, they're the x axis is the date in this study.

This individual's date in the study and the blue curve, is there estimated effect of engagement at that time in the study, and it starts off at zero because the prior mean for engagement was zero. All of these curves start off at zero for that reason. And so the blue curve is the real data. The age there's 2000 black cars. Now what we're doing instead of we're getting a bootstrap version, 2000 bootstrap versions of this user.

So the way we get a. So let's just think of one bootstrap version of this user. We we formed that state residual, state residual, state residual. Like I talked about in the prior slide. And then then for that, just for that user. And now we bootstrap those state residuals. Now we re sample those. And we run the banded algorithm under the ground. Truth of there's no effect of engagement, but everything else stays the same.

So we just set that data weight for that, that the posterior mean for that theta equals zero. So now each black line is how the posterior mean, how what the band thanks for that individual is that individual's posterior mean for the treatment effect as the study progresses. And indeed, you see, at the very beginning, the real data is highly consistent with no effect. I mean, the blue curve is well within the mass of the black curves.

Well within that mass. But as time goes on, the blue curve drifts to the top. And in fact, I have a statistic here about eight percent of the black curves. Oh. Have a positive posterior mean, ninety five percent of the time. Where's the blue curve has a posterior part. The blue curve has a posterior positive knee prior mean posterior mean treatment effect of engagement. Ninety five percent of the time and only eight percent of the black curves have that.

Oh, it's very interesting if you compare this to the prior graph, because even though it looks like at the very beginning, you know, way into March and April. That being engaged means it's better to send a message. But the the black lines indicate, well that blue curve is well within the variance that you might expect, even if there's no effective engagement.

It's only when you get to June of 2020 that you start to see some indication that there's enough evidence that really engagement should be taken into account. This is my last slide. So here we what we did was we used a sequential online decision making our personalisation algorithm. But did we achieve? Personalised digital health, personalised sequential decision making, decision making.

So in this whole analysis, what I did was I assumed that each user there, that user's state reward followed like a classical bandit. That is that prior actions don't influence future rewards even in that setting. How could you do a better job even if you're willing to make that assumption? How could you do a better job assessing personalisation? Is this a completely open?

And if the bandit environment assumption is violated, which it definitely is, because if I send too many notifications in future, you're probably going to be less responsive. How do you assess this? So how do you assess this in a more cost kind of setting? As far as I know, there's just nothing there. And these are uniquely these are uniquely statistical questions, and they're critical for using A.I. in sequential decision making. Thanks. And you. So is there any question for folks who that.

I don't see if you want to talk, I don't know if you can use your cell phone speaker or leave a question in the chat. So I read one, actually, so I mean, when you work in this kind of context of how I would be a bit paranoid, you know? You know, meeting some confounders or, you know, self-control news, I mean, they are like a lot of people are looking at, you know, being extremely careful in the design of the kind of state space to really consider these kind of things.

I mean. All right. This is like this. Yeah, so. So these are designed experiments, right? So an enormous amount of work goes into deciding what will be sensed, and it is related to what the scientist the scientific domain says should be important. That said, this is a very immature area of science, so it's probably implausible. It's implausible that we collected the entire state, and that is the reason why I think we're always going to see some element.

We're we're going to get the appearance of non stationary, not because it might be true, but rather because people are moving to another step, but we don't know what they look to be in the same state to us. So, so the issue of non stationary is and this is one of the reasons why you don't really ever want to let those probabilities go to one or zero. You want to be able to do intermittent analysis where you off policy analysis and you look,

you know, are we getting some evidence of non stationary here? Yeah. Is there any other question? I mean, related to your policy evaluation. I've done a bit of that recently. So I you you use the one you penalty was basically violence only, but I guess it was before a long time. So if it doesn't work, though, what kind of things developed? Well, it's a high variance estimate here. That's number one. But so right now, there is a lot of research.

This is a very active area of research. How do you do off policy estimation when you only have one trajectory? And I think there's a new paper on archive that came out like two, maybe in the last couple of months by her dad, Susan Athey. I think she may be on that and she has a way now. A lot of the problem. We did a lot of simulations for this and we didn't see any evidence.

But that's not simulations are not proofs, right? And in the end, we also ourselves have work on how you can do estimation after you've adaptively sampled and there's different ways to do to wait to try and adjust for it. The big problem is really when there's it's under certain scenarios, you get aberrant behaviour. In our simulations, we didn't see this, but when we were writing all of this up and we'll probably use well, take whatever's most recent in the literature and use it.

Yeah. Yeah. But this is an area like it's great you asked that question because this is an area of very active research. How do you do off policy learning on one trajectory, not in independent trajectories in which you have an independent vantage, but one trajectory? Yeah. I'm actually kind of, you know, real vocal learning, and I know they are like, not the economy that other work in that way, but he can be quite positive on that.

Yeah, no question about that. You have no idea, but your estimate that you get of it when you do it right. It's interesting, though. You have to work hard to get it to misbehave. It's not it depends, of course, on what you're estimating if you're a truck. And also notice we were clipping point two point eight if this changes things enormously. Exactly. And that's why the simulations probably worked out OK for us.

If we had allowed the privilege to get close to zero one, that's where you really get the problems. So. Is there any other question for Susan? Oh. So I've got a question in the chat for delay, which they're asking you whether you could explain a bit more on the bootstrap sampling within each individual. Yeah. So there was two types of bootstrap samples, one in which we bootstrapped individuals and one in which we had one individual and we just bootstrapped within that individual.

In both cases, the first step was the same. The first step was so for each individual, we have a whole time series of state action rewards, state action rewards, state action reward. So what we would do is we calculated the posterior mean at the end of the study of that individual's reward function, their mean reward in that state at that time. So and we subtracted that from the reward.

So we so it was now we had state action reward minus posterior mean in that state state action reward minus posterior mean in that state. I'll call those residuals those differences. So now we have state action, we throw away the action state. Residual state residual state residual for each of the 91 individuals. We have a whole series and then we do the bootstrap sampling under a ground truth that we're trying to test against.

It's like a null. And so in the first case, the ground truth was that the that the prior means were correct. The ones we built off of heart steps would be one. And so the way that happens is you, you take one bootstrapped and one individual, which is state residual, state residual, state residual. And you run the bootstrap on that data. So. So when the boot, I mean, I'm sorry, the Band-Aid on that one individual. And so when the the bandit sees the state, it chooses an action.

If the action is one you add back in the the mean from the prior, if not, if the actually if the action the bandit shows is zero, you leave the residual alone, just the reward now and you just move through time like that for that one person and you do it for all 91 people that you bootstrapped. In the case of. So in the first case, we bootstrap individuals. We bootstrap trajectories.

And the second case, we actually just had one individual state residual, state residual, state residual, and we bootstrapped those little pairs. Thank you. Is there any other question for Suzanne? No, well, let's see that again. Thank you. Yes, thank you very much. You couldn't. You couldn't go to Expo. You will be able to welcome you soon. And also thank you again for OK. Yeah, have a great weekend. Thank you. Thank you very much. Bye.

Transcript source: Provided by creator in RSS feed: download file
For the best experience, listen in Metacast app for iOS or Android