Hello! Welcome to Casual Inference. I'm Lucy D'Agostino-McGowan from Wake Forest University. And I'm Ellie Murray. I'm an epidemiologist and I write about epidemiology and public health on Substack at epile.substack.com. We are a partnership podcast with the American Journal of Epidemiology. And we have... Funding from Wake Forest University? Yes, the Andrews Church Faculty Fund and the Department of Statistical.
Excellent. Yeah, I think that's it. Well, I'm excited today. We have a very exciting guest that we're going to be speaking to about some causal and casual inference type topics. So should we go ahead and introduce them, Ellie? Do you have any follow-up before we do that? No, let's dive right in. Okay, great. Well, so I am pleased to introduce Noah Griefer, who is a statistical consultant and programmer at Harvard University and the Institute for Quantitative Social Science. I have...
Mostly across the R universe, he's got some excellent packages for handling some complex causal inference type. techniques. And so I'm really excited to talk with Noah today about some of that. And I think we'll also get into some kind of some different ways to estimate causal effects.
that i think we maybe haven't covered yet on the podcast so we've covered a lot so this is um this i think this should be a fun conversation so uh noah welcome thank you i'm very happy to be here very honored to be asked to be a guest Well, we are happy to have you. It's exciting to be able to feel like even six seasons in, we have so many people we still haven't spoken with. So I'm so glad to finally have you.
have you on. Do you want to maybe tell us a little bit about the work that you do and then we can dive into some of our questions? Sure. So maybe unlike a lot of your guests, I'm actually not a researcher. I'm a statistical consultant, which means I help researchers with their research. I work at Harvard. Office at IQSS, the Data Science Services team supports
research at Harvard. So any statistical question that anybody has, we help them with an answer. That is a big part of my job. Another part of my job is writing our packages. Probably your listeners know R is a statistical programming software for analyzing data and an R package is kind of like a collection of functions that makes it easier to implement. some method. And I have written, you know, maybe 10 or 15 R packages in my time, many of which are related to causal inference, others that are.
just related to inference broadly and some for my job that are related to topics i don't even really understand but um i You know, my job is to allow researchers not to have to spend time writing an R package. Instead, I do that and they get to continue doing their research. My area of expertise is causal inference and observational studies.
particular like propensity score analysis like propensity score matching and propensity score weighting These are just methods that you can use under a specific assumption. collected all relevant confounders in an observational study, which is maybe a tenuous assumption, but if you're willing to make it, there are methods available and I kind of work on those methods.
in particular their implementation. So I haven't done a lot of like statistical research or even I don't really have a substantive area of expertise, but I consider myself really focused on statistical computing and programming. I feel like the other place where I've seen your name, I was just last week, I was trying to. What was I trying to figure out? I was trying to figure something out and I can't remember what it was.
But I came across one of your posts on Stack Overflow. I feel like this is like... one of your areas where you've been very prolific. You and Frank Harrell are two of my go-to. If I see that you've answered the question or Frank has answered the question, I'm like, all right, this is the good stuff.
Oh, thank you. Yeah, I know. I love Stack Overflow. Somehow they've managed to like gamify statistical consulting. And so I love getting like points for answering a question or someone upvotes my question. Yeah, I just really enjoy that. A lot of people have given me positive feedback on my answer, so it feels really nice to contribute to the community in that way.
Yeah, I'm trying. I wish I could remember what it was. It was definitely something propensity score. Maybe it was like missing data. I can't remember. Or maybe it was a balance question.
i can't remember anyways i but i always when i when i find something i'm like oh good noah's thought about this carefully i can i can just impute his thoughts into my brain well good well maybe so the packages that i am most familiar with that you have uh written the two are the match it and wait it package um although i know you have other cold cobalt right so um we're balancing so me but um
maybe we could talk a little bit about you've made a lot of recent changes to the weighted package maybe we could just talk a little bit about maybe like your thoughts behind that and then um Just broadly, so you mentioned that you do propensity score type, you know, your kind of area of expertise tends to be when you're trying.
to answer some kind of causal question with observational data and maybe using something like a propensity score and i know recently i just from like seeing some what you've been putting out it seems like um you're moving towards some different methods for weighting that are like not ones that I've used before. And so I'm just curious to hear about that as well. Yeah. Love that. Yeah. So for those that don't know, weight it.
is this R package that allows you to implement propensity score weighting and other related weighting methods for estimating causal effects.
weighting methods work by kind of estimating weights for each individual in your sample. And the idea is that in the weighted sample, if you've kind of collected all of your confounders and performed the weighting correctly, It's like you have... a randomized experiment obviously that's only under certain assumptions so it's not the waiting gives you that for free
But in your weighted sample, if you have two groups, let's say treated and control, the treated and control groups will look similar to each other on all of the background variables that you're trying to adjust for. And then you can estimate. the causal effect, the treatment effect in a pretty simple way, just as a kind of weighted difference in means or a regression of the, a weighted regression of the outcome on the treatment and possibly the covariate.
And there are a lot of ways of estimating these weights. The kind of classic, theoretically justified method is propensity score weighting, where you estimate a propensity score, which is the probability of receiving treatment. And then you compute the weights as a function of the propensity score. And there's a formula depending on which target population you are trying to generalize your estimate to. And with those weights, you assess the degree to it.
Balance was achieved, which is to say that the distribution of the confounding variables are similar between the treatment and control group. If they're well balanced, then you can estimate your treatment effect using those weights. So that's what Waitit does. And Waitit offers a kind of nice, simple interface that allows the user not to have to program too much.
which I think makes it accessible to non-programmers and those who are maybe not as methodologically advanced, but it gives them access to these advanced methods that have been described in the literature. There are many ways to estimate propensity scores. The kind of classic ways with logistic regression is just a regression of the treatment. if it's binary, on the covariates and then you generate the weights using the predicted probabilities as the propensity scores.
And logistic regression is one of many, many, many ways to estimate those weights and weighted just implements some of these other methods so that the user doesn't have to manually program their own. estimator maybe and just like i just want to go back actually because you mentioned something about the target population which i know you have that paper with liz that is so great i send it to all of my students and when we do workshops we show it to the workshops we have a table we're writing this
entered a causal inference text that we like have basically cited your paper and used a replication of that table. It's so good because I think it really helps. people like practitioners understand what's going on behind the scenes when we're when we're kind of targeting different groups and so maybe we can take a quick detour and just briefly talk about that and then we can come back to what i my question about the different types of ways to estimate these weights and so
Yeah, I think you described it so well. So the propensity score, all that is, and we have, I think, talked a little bit on the podcast about that, but it's, yeah, exactly as you said, just the probability of getting the treatment given some known covariates.
and typically those would be the confounders if you have them all that's should be the confounders and then um you would take that and you can basically up or down weight individuals in your population so that they sort of look like whatever the target is. So the like most basic way with like something like an average treatment effect weight, which would just sort of try to generalize to the whole population. Sometimes what I like to do and I'm trying and this is like hard.
vocally so maybe you can both help me but when what i when i'm trying to help students understand different target populations sometimes i like to make these weighted tables and i'll make first i'll make a table in our study population where I have a column for all of my confounders for the people in the treated group, a column for all my confounders for the people in the...
control group and a column for all my confounders overall and by by that i mean like basically a summary table so it'll it'll show maybe like the mean or the median of each of the continuous variables with some kind of measure of spread and then it will show like a percent for my binary variables so i have this split by treatment control and then one overall column and so what weighting is doing in some ways is like when you're getting an average
treatment effect weight so sort of this overall average you're sort of trying to like redistribute the confounders so that everybody's distribution looks like that overall column so if you have like If overall the average age was 54, but in the treatment group it was 75, and in the control group it was 32. Then after doing the average treatment effect weights, you would expect that the overall average would be 54. It would look like that overall column. So you're sort of like up weighting.
the control folks that have older ages and down weighting the treatment folks that have older ages to sort of get them to match this overall distribution. And likewise, You can also instead wait everybody to look like the treatment population. So this would be like the average treatment effect among the treated. And if you do that, then everyone's going to look like that treated column in terms of the distribution of their confounders. And so like the age now.
the weighted age you know the weighted average age would be like that 75 that the treatment arm, or you could do average treatment effect among the controls, which would give you
everybody would look like that control column so their average weight the weighted average weight would be like 32 or whatever i said i'm losing track of my numbers and then there's these middle zones and maybe this is where i want to ask you a question no i'm sorry i've been talking for a long time but uh there's sort of some of these middle kind of equipoise type weights, which rather than...
giving you either exactly like the treated column, exactly like the control column, or exactly like the overall column. They might give you some average treatment effect among this like kind of overlap or, you know. some there's like some that sort of think about it as in terms of matchable like evenly matchable type population and that would be some fourth option that's maybe different from anything that we've seen in the kind of individual columns of the data. And maybe that's the one.
can you tell us about maybe when that might be useful or what that might mean or how that how you think about that type of um because i think some of these other weights that you're using are also estimating that kind of middle zone am i right or am i maybe i'm sure so yeah love talking about this and thank you so much for reading this paper and telling people about it i wrote it with liz stewart um who was my postdoc advisor at johns hopkins
wonderful person, really inspirational to me. Yeah, so as you said very clearly, there are these different estimates or different target populations and there is this kind of odd one the average treatment effect in the overlap or match sample or equipoise group
Different people have different ideas about when you should use this. Some people think it's kind of like a worthwhile S demand in itself. When I say the word S demand, I just mean a kind of quantity that you are estimating the thing you're estimating usually with reference to the target population so i'll say s demand and i mean kind of the overlap group or if We were looking at the average treatment effect and it treated the S demand. That's the S demand for the treated group.
So, yes, this overlap S-demand. Some people think that it's an S-demand unto itself that is worth pursuing because it kind of represents. this population about whom you have the least certainty, which in some context would be called the equipoise population. I tend to think of it more that it's almost like a fallback. It's like when your data does not allow you to estimate one of the other more interpretable estimates because your groups are too...
dissimilar from each other or there isn't enough overlap between your two groups. You kind of focus your inference on just the areas where there is overlap. It's a limitation in the sense that You can no longer make a claim like this is what would occur on average if everybody were to be treated versus if everybody were to be untreated. It's a much narrower claim, which is just for the overlap group.
you know what the effect would be were these people to be treated versus untreated but i kind of respect the honesty of the kind of limitedness of that claim because you are acknowledging that you can't make inferences about everyone your sample just might not allow you to make a claim about the whole population. Let's say your treated group is so vastly different from your control group.
So you have no idea what some of your treated units would look like had they not received treatment. In that case, you can't target the ATE, the average treatment effect in the population, and kind of the ATO, average treatment effect in the overlap is...
almost like a consolation prize. But rather than extrapolating your inference to the overall population, which some statistical methods do, you... kind of are more honest about just targeting your inference to a more limited population, the average treatment between the overlap.
This effect tends to be estimated with more precision because it requires less extrapolation. So that's a reason to pursue it. And there are some theoretical proofs that show that this is kind of the... has the optimal precision so it's not just like oh it's a little better it's like Of all ways to estimate weights, these weights yield the best precision, the most.
precise effect estimate. So there are good theoretical reasons to pursue this S-demand, but I usually see it as a last resort when the other S-demands just aren't supported in the data, mostly because I find it less interpretable and it's not really an S-demand you can... describe prospectively. It's kind of a thing that arises from the sample rather than, you know, a natural policy target.
Yeah, I think that's a great point. And actually, it brings up what Ellie and I on the last episode, which has not yet been released, but will be released by the time this comes out. We talked about different assumptions of causal inference. We were trying to tie them back to like the large language models.
But one of the ones that we covered was positivity, this positivity assumption, which is basically what you were just describing about when there's lack of overlap, either due to structural positivity violations or... maybe just, you know, some kind of stochastic positivity violation, but essentially it means that there's like some...
you know that maybe there's some space in the in this kind of confounder space or you could think about it and like a it's the propensity score is sort of this like summary of this multi-dimensional space where there's an area where there really isn't uh there are no control individuals, which would imply that their probability of getting the treatment maybe was zero or close to zero. So which kind of gives you this.
violation if they really have no chance of getting the treatment there's no counterfactual to estimate and so it's like there's you just you know you're stuck with this overlap. But I really like your point that in some ways it's a fallback when you can't estimate things well. You know, this is a way. And I think one thing that I...
As you said, it's something that you would have to, that is representative of this. It has to do with the sample. It's something you can only kind of estimate retrospectively. Although I do think... It can be used prospectively in the sense that if you provide an overlap estimate and also a table describing who's in that population, then you can, you know, so like if I'm a clinician and I see that there is this causal effect.
positive causal effect in this overlap population, and I'm not sure what to do with it. With my patients, I can look at the table, the weighted table, and see whether... you know, the population that I serve falls within the bounds of all the different confounders that are in that table. And that can give me some sense for whether this will apply to me or whether my patient.
is more likely in this kind of non-positivity range where I really don't know what to do in terms of their treatment. Yeah, I think that's, I think. That all sounds grand to me. So one thing I like about the overlap weights, I will say that it. Sometimes they're analogous to a match with a caliper, which just means you have a match where you set some range where you only allow individuals to be matched to other individuals as long as they fall within some range.
threshold of difference between each other and so this threshold is known as the caliper and you know there are rules of thumb for how to pick a caliper but in theory a researcher can pick whatever caliper they want and you can get a different population and also a different estimate depending on what that caliper is and you know with matching
depending on how you're doing it, but the most basic ways you kind of just throw out people that don't have matches. So they basically get a weight of zero. And then everybody who's in your population gets a weight of one. And I think these overlap type weights sort of serve. as a more continuous way to deal with this type of
you know, the similar type of method, which as a statistician is very satisfying to keep things continuous and to have me make fewer choices about what my caliper will be and sort of let the data just like do that for me. So I'm not sort of. dictating how the analysis goes based on some choices that I'm making. But I will say from a practical standpoint and something I've thought a lot about, and you probably think a lot about too as a consultant is that.
I've worked on papers where a matched estimate is the clinician understands it better because everyone has a whole number. They're either in the study or they're not. It's a little easier to draw the connection between like a randomized trial where we have exclusion criteria. So they sort of can like.
wrap their head around that idea. We don't have a table with partial people, like with weights, you can have like 0.3 of a person in a table. And so if they're giving us, if in practice I'm getting. basically the same estimate with maybe a little bit less precision, but like my clinician's going to trust the result and actually, you know, do something with it, then maybe, I don't know, maybe matching is better in that regard because the end goal is to get people to actually.
implement what, you know, the right, make the right policy choice or whatever for their patients or themselves or whoever. Well, as you know, I love matching too. I don't need to be convinced, but yes, people have different reasons for performing one or the other. Yeah. Yeah, I think it's interesting to kind of think about this idea of overlap and the idea of positivity because if you were really like, or if you don't have any positive violations, then you should have perfect overlap.
because your two groups should be basically the same. But a lot of, I mean, obviously we have random positivity violations all the time. I think it's kind of interesting to think about how you could use the overlap population to kind of identify those more structural positivity violations to say like, these are people. that really are not getting one of the treatments or the other.
And I also think kind of moving back to something that you said at the very beginning, which a little bit I think was opposite of the way Lucy said it. But I think that those people who are in the overlap, those are the people you have uncertainty about.
like lucy mentioned like oh the people outside the overlap you don't know what to do but i think it's really more the opposite that those are the people that clinicians know what to do about and that's why they're always getting one thing or they're always getting the other thing And so it might not be the right choice, but like clinicians agree. And therefore there's like at least a perception of less uncertainty.
so i think it you know kind of selling saying it that way that like this is focusing in on these people where really treatment could go like one way or the other and clinicians have uncertainty and that's why we have overlap between the two. I think that's a really interesting kind of way to think about that. the overlap effect. But yeah, I think there's, you know, if you have structural positivity violations, then you could definitely like prospectively, like you shouldn't.
there shouldn't be any difference between your average and your your overlap because you shouldn't be including those structural people that are like structural violations in your overall average treatment effect. I don't know, there wasn't really a question there. Just thinking it's kind of interesting to think about like, you know, I mean, as you say, it's sort of a last resort. But also, if your assumptions are met, there's not there shouldn't be any difference.
because positivity is one of the assumptions. we kind of have to make and obviously never meet it perfectly but if we did meet it perfectly there wouldn't be a difference yes certainly i think that overlap weights are particularly effective when the kind of positivity violation isn't necessarily structural so some people are you know receiving treatments but it's maybe really rare for a type of patient to receive one treatment or the other and when you use certain types of weight
The kind of probability of being in the treatment group that you are in is in the denominator. And if that's a really small number, it can really blow up the weight. which just means the eventual effect estimate has a ton of uncertainty because the uncertainty in the effect estimate is a function of the variability of the weights. So even if...
You know, you have positivity. Everybody is eligible for either treatment, but some people are very unlikely to get one or the other. Overlap weights can kind of be a way of saving your estimate. from being completely washed out by uncertainty and allows you to focus the precision that you do have or can have on a different population about whom you can make a more precise estimate. So even if all the assumptions are met, there is still value in focusing on an overlap population, I think.
Yeah, when I was in graduate school, one of my... colleagues, Lori Samuels, she made this. shiny application which i i loved and it doesn't it's not on the web anymore but i've been like trying to figure out ways to sort of re-implement what she had done um and maybe maybe you thought about this you know but it was called she called it visual pruner but the idea was um that you would like
look at the distribution of the propensity scores and you could like zoom in on an overlap population, either like manually or, you know, in theory, you could use something like, you know, these overlap weights to sort of see where the most of the mass is. what it would do is highlight all of the different
distributions of the confounders among those in the tails that were sort of getting cut off. And the idea was that you could use this as sort of like a way to set some exclusion criteria on the front end so that instead of using like relying on these overlap weights, which can be a little add.
they're not they're ad hoc in the sense that they're harder to understand not that they're ad hoc because you're having to like set your own parameters but anyway so that you could sort of if you could figure out what's going on in those tails that maybe is driving the
lack of positivity, and then set that exclusion criteria on the front end to make it more clear for whom this estimate is relevant. I really like... that because and and i liked it um she used it also for like when people would do trimming so you know you could in theory you can for these other types of weights where the weight is greater than one you could trim when it gets too big to avoid
um you know having kind of bad standard errors but trimming i think also can change the s demand in some way you're sort of it's sort of like an ad hoc way of handling an overlap type where you're like inducing some researcher degrees of freedom. And so this idea of instead of just like trimming the weight, but figuring out who those people are and can we exclude them on the front end, I think is kind of a nice.
idea. We had, it was driven by, we had all worked on like this paper with some VA data and it was looking at like different diabetic populations. And I just remember.
there was a it was one of these effects that was finicky it's a big pot it was a big population because it was like lots of you know veterans but the effect was like it could like blow either way like basically you you could have a significant effect that was like right up against the null in one direction or if you trimmed a little bit you could end up with a significant effect just on the other side like it was just
which maybe meant that it was, we should have just like concluded this was an inconclusive result, but it was one of these things where it felt like as the researcher, we had too much power in terms of determining, you know, what side of the null this effect was on. And, um,
I was not the primary person on this, but my recollection of it was that they found that the individuals that were getting trimmed, they all had these like unusual A1C values. And so that was sort of like... actually there was some kind of exclusion we could set on the front end that would have, so that we weren't having to be like, oh, we're just going to like, if we, you know, if we trim at the 99th percentile versus the 98th percentile, we get a totally different answer.
But anyways, I think some of that stuff to me is very interesting in terms of, you know, there's not like necessarily statistical theory there, but there's a lot of application that I think it probably drives where most people are actually spending there. time yeah no i love this idea of kind of interpretable um ways to kind of understand the target population that you're estimating your effect for.
fortunately weighted allows you to implement some of these methods trimming or choosing your s demand and um one of my other r packages that you mentioned cobalt is explicitly designed for assessing the distribution of covariates before or after waiting. And that also can allow you to see other individuals that are kind of unusual for... a type of treatment and maybe those we just can't make an inference about and so if we were to be kind of
you know, emulating the hypothetical trial, we might say, okay, we're going to exclude these people. It's not that they are irrelevant. It's just that we can't learn about them because there's not enough data about them. where they're so unlike anyone else that we can't learn what would have happened had they undergone a different treat.
So it's almost you're being more fair to that population by not extrapolating an inference for them. So yes, doing that on the basis of the propensity scores is maybe a statistical abstract way, but looking at the specific distribution of covariate. that corresponds to your target population, but also the kind of left out population.
is definitely really useful. And there are a variety of methods for doing that too. So I'm curious, I have like two different, maybe Ellie, you can dictate which way you think we should go because I have two different ideas for kind of what we could go next. I either want to hear about... you mentioned some of these other ways to wait like the
Yeah, COVID balancing propensity score. Yeah. So that's one direction I'm interested in. And the other would be to sort of go kind of staying with the parametric weights and thinking about like M estimation, although I suppose M estimation, maybe. have to be parametric but my um anyways do you have a preference ellie for where we where we drive next no i think i think either one is is interesting so know if you have a preferred sort of topic to talk about
Sure. I would love to talk about the different methods of estimating weights because I think this is a maybe misunderstood area, but it does also ties into estimation, which we can talk about too. As I mentioned earlier, estimating the weights using logistic regression for the propensity scores is kind of the standard, but...
There have been two broad routes that people have gone down to improve the estimation of the propensity scores. One route uses machine learning to fit a really flexible model for the propensity score that allows for non-linearities and interactions and makes few functional form assumptions. And these methods are really popular in certain fields.
They often play a role in doubly robust estimators, like some of your previous guests have talked about, because they are more likely to potentially capture the true... data generating mechanism, which then gives the effect estimates and nice statistical properties and large samples. So it's one road that weight estimation can go down and that's implemented and weighted. And another road that people have gone down is kind of...
what you might call like optimization or balance focused. And so instead of estimating the weights to kind of get a good fit for the propensity score, you estimate the weight. in such a way that balance is achieved directly. So rather than just kind of using the theoretical properties of a well-estimated propensity score, you almost like nudge the propensity score in the direction of balance.
And then there are some weighting methods that even just say, hey, if we want balance, let's just skip the propensity score and estimate weights directly that give us balance. which I think is really cool. And some recent research that I've been very excited about lately has been showing that even those methods where you are claiming not to be estimating a propensity score and just going for balance.
implicitly proposes a model for the propensity score. And that is kind of an assumption that people are like, it's less visible. But so I Maybe you want to focus on these, this like ladder class of more like optimization balance based methods, because that's just where my expertise is. And I think, you know, the machine learning side is better understood by some other people.
This is fascinating to me. So I guess one question here that immediately comes to mind is that the overlap weights, now they, from my understanding, I guess I don't fully know the history of how they...
kind of popped up. But from what I've heard from conversations is that this was sort of like... um it was a happy not accidents not the right word but it the fact that it perfectly balances maybe i'm like talking in circles what i'm trying to say is if you put variables confounders into a logistic regression model and use the the propensity score generated from that to get the overlap weights.
of these overlap weights is that any confounder that was in that logistic regression model will be perfectly balanced on the mean because the way that the weight is constructed is like... it has this like very nice variance property and also it has this you know you can you can kind of derive it in either direction you could start with let's say we're going to fit this logistic regression model
to these variables what would perfectly balance them all in the mean or we could say we did fit this you know and this is the weight in any case so it seems like that should fall under the class of these balancing weights that you're describing in my in my mind, I feel like the overlap weight should be among them. And it's sort of like if you were implicitly fitting this logistic regression model behind the scenes, is that this, am I thinking about that the right way in terms of how these.
Kind of. That is definitely related. And the fact... That what you mentioned is when you estimate these overlap weights, which have a specific formula, in particular using logistic regression, but only logistic regression. Right. The covariate means in the weighted sample are exactly equal to each other between the treated and control groups, which is great. You get this exact balancing property, which kind of just you get for free. That is a quirk.
It's like incidental in some ways. It's incidental. It's a quirk of the formula for these weights and logistic regression, but it doesn't happen if you use logistic regression for other types of weights, and it also doesn't happen if you use a different model for the overlap weights. This actually is a great segue because the methods I want to talk about encompass these overlap weights. Basically, the idea is...
maybe starting from logistic regression. When you fit logistic regression model, you maximize the likelihood using maximum likelihood. And the way we know that the likelihood is maximized is when the partial derivative of the likelihood function is equal to zero. The likelihood is just a kind of... complicated function of the parameters you're trying to estimate and your data. And one way of solving the likelihood is just to find that point.
which is to say the value of the parameters in the logistic regression model that yield the derivative of the log likelihood being equal to zero. The derivative of the log likelihood we call the score equation. Yes. Can I pause you for one second? Just to give my non-calculus thinking listener a quick, what I like to talk to my students about when I'm talking about maximum likelihood and all this. You can think about like, so if you think about.
any kind of like a likelihood you could think of as like a a curve, you know, it's got this like, and it sort of looks like a mountain and we're trying to get to the top of the mountain. That would be the maximum of this thing. And if you think about what a derivative is for people who haven't thought about derivatives in a long time, it's basically like.
i in high school maybe you did something where you drew kind of like lines against like if you were to draw a bunch of kind of lines that fell is this parallel i guess to the to the mountain like if you're sort of tangent thank you to the mountain and you're basically trying to find the top and so you want this line to be flat because the point where the
Here at the top of the mountain is when it's going to be when the line is flat. And when the tangent line is flat, it's going to be when the derivative is equal to zero. So this is sort of all we're trying to do is find the top of this mountain. And that's why it needs to be.
That's why we're trying to set this thing equal to zero. Mathematically, it works out very nicely that all you have to do is take a partial derivative, but you could imagine that we're just like climbing up this mountain and trying to find the top. Okay, go ahead.
weighting methods say okay well instead of trying to find the top of this mountain the maximum of the likelihood this spot where the tangent line has a slope of zero what if we found a value of the parameter that made the covariates balance So instead of asking where the derivative of the likelihood is zero, you ask where the difference in means in the weighted sample is equal to zero. And that is just kind of a different way of estimating the weight.
What you get, it's still the model that you chose, logistic regression, prohibit regression, whatever regression model you're using. But instead of maximizing the likelihood, you are just achieving covariate bounds. Now, the whole point of the overlap weight... that the point that maximizes the likelihood happens to be the point that also gives you the weighted mean difference equal to zero. But that's only true for the overlap weights and only for logistic regression.
right there's this other framework which you mentioned earlier called the covariate balancing propensity score just kind of a generic sounding name but it also does what it says on the box It basically says instead of maximizing the likelihood and instead of having it be just for logistic regression or just for the average treatment effect in the overlap for any S demand and for any. a link function for any type of model for our treatment.
can we choose the parameters of that model that yield covariate balance exactly? So it's just a different way of estimating the parameters of the logistic regression model or whatever regression model you're using. that yield balance. And the kind of miracle of the overlap weights is that
are a covariate balancing propensity score. It's just a coincidence that they happen to coincide. But we don't need to just hope that our estimator yields covariate balance. We can just use the covariate balancing propensity score to target it exactly. Really cool method. Waitit was not the first package to implement it. It was implemented by the authors, the developers of the method, in particular Kosuke Imai at Harvard.
And I decided to rewrite it for Waitit and add a bunch of new options, including the ability to target any S demand, the ability to target the ATO, even when you don't have logistic regression or you don't want to use logistic regression. So yeah, it's a great method. And then there are a lot of kind of other related methods, kind of our variations on that. Yeah, that's fascinating. So then, can you...
Sometimes when I'm teaching the overlap weights, for example, I usually have my students, we like to make these little love plots, which are basically just looking at the standardized mean difference between all the confounders and oh my goodness, it's all zero because it's been set up like that. But then...
I've actually simulated the data such that, you know, it's perfectly balanced on the mean, but it actually is like very imbalanced in other parts of the distribution of this particular covariate. And so, you know, by looking at just... the first moment we're like missing some information about the balance there. And so I'm just curious, like these covariate balancing propensity scores, are there ways to target more than just the mean difference between them to dictate balance?
Fantastic question. There are a few ways. It's so important. It's so important because people often think about the means, but really the whole point is to balance the entire covariate distribution. So one way of doing is to just add terms to the uh model so if you add a squared term for a covariate you are balancing that square which is like balancing the variance of
that variable, that's great, but you still have to know which specific terms to balance. And that is kind of imposing a certain knowledge about the structure of... you might not have. So there are alternative methods of estimating weights that target balance on a kind of broad, on the distribution of covariates in a much broader sense.
rather than just on the means of the specific functions of the covariates that you include in the model. These are some of my absolute favorite methods. So I'm so glad you asked about them. The one that I'm like really excited about that's implemented in match it or sorry, in weight it.
It's called energy balancing, which was developed by Hooling and Mac. Really... uh cool method that sounds really fancy because it has energy in the name as if it's like so sciency um basically you know the the idea is that when we describe balance between our covariate distributions. We can summarize that using the difference in means between each group on a given covariate. But there is this multivariate.
scalar which just means multivariate means all the variables and scalar means it's a single number that summarizes a multivariate scalar measure of covariate balance which is called the energy Why it's called that, don't even worry about. But just think of it as this one number that kind of summarizes covariate balance across. all features of the covariate distribution, not just the means and not just the individual marginal distributions of the covariates, but their joint distribution as well.
And what we can do is estimate weights that minimize the energy distance between the treated and control group in the weighted sample.
And of course, that is available in WeightIt, but it's this amazing method because it allows you not really to have to... uh choose specific functions of the covariates to get these exactly equal means it almost like does it for you it kind of figures out the weights that make the treatment control groups most similar to each other while retaining the target population that you want to target
So I love energy balancing. I wish it would become more mainstream. It has some limitations. It requires an N by N matrix where N is your sample size in order to work. can be huge when you have a big sample size so it doesn't really work very well for large data sets. And then estimating uncertainty in your effect estimate.
Using energy balancing, there aren't really good ways to do it except bootstrapping, which requires you to fit these weights many, many times, and that can be computationally intensive. but as a kind of direction for future research and as this kind of nice... promise for what is possible if we could increase our computing power. I think it's a really good method that deserves more attention.
Now, do you think that there's some utility in having this energy? So I think you've talked to me about this before. I feel like I've talked to you about it before. It's like ringing a bell, but I had not thought about this single number summary before. Definitely not on the podcast. But is this idea of this kind of single number summary, do you think there's utility? So even if I have a sample that's like way too big for me to consider these energy weights, but do you think I could use this?
energy score or whatever it's called as just like a way to describe how well my choice in balance did across all like could this be a one way to be able to just establish that balance was met using whatever method I implemented? Great question. I think it absolutely could be. Just to calculate the energy distance between your two groups does require this n by n matrix. Even if you can't, you know. if your sample is too big to estimate the energy balancing weight.
It might also be too big just to calculate the energy distance at all. That's one problem. Another problem is that it's not really... So it has this interpretation where if the energy distance is exactly equal to zero, then your groups are identical. But beyond zero, there's no kind of like range of acceptable values or common values. So you can't just look at a single energy distance and say, oh, yep. It's good. It's good. What you can do is use it to compare to weighting specifications.
And that is something you can do and weight it. So if you're thinking about machine learning, machine learning models often have what's called a tuning parameter, which is, you know, a kind of... a feature that controls some aspect of the model, which could be how much is regularized or how many trees are used if it's a tree type model.
And one way to choose these tuning parameters is by using cross-validation accuracy. So how well does the model predict on an outside sample, given each value of... this parameter but another way is which parameter yields the best balance and a good way to measure that balance is using the energy So yeah, it can be used as a relative measure of balance, but not really an absolute measure of balance.
There are some possibilities, like one thing that hasn't exactly been described in the literature, but is in one of my more like viral... cross-validated answers, cross-validated is like Stack Overflow Statistics Helps website that we mentioned earlier, is to kind of do a randomization or permutation distribution of the energy distance in a hypothetical randomized sample.
and see where your weighted energy distance falls in that distribution. Oh, cool. Yeah. So, yeah, it's definitely a great and promising measure. And I think that just the limitation is computing power and the interpretability of the number itself. Otherwise, it's really... I think one of the things that's really interesting about balance is, I mean, I think it's a very...
Nice way to assess like, you know, how well have your weights sort of brought the distribution of your variables together. But I think something that.
you know there's often with when we're using propensity scores there's like a lot of focus on like are are the covariates balanced and um that it kind of always makes me think about the fact like in a randomized trial setting, you know, there's this, especially, you know, a lot of the statistics is sort of moving away from the idea of like checking the balance in the randomized arms because.
The randomization isn't guaranteed to give you perfect balance in a trial. It's give you balance on average across like all the trials you could have run. And so, you know, something similar is kind of true here where we don't actually necessarily need perfect balance. We need a process that is expected to sort of generate balance. And we could have perfect balance and still have confounding, or we could have no confounding and have imperfect balance in a given sample.
Yeah, so I was just wondering if you want to sort of talk a little bit about how, you know, from the observational side, those things are reconciled. Like obviously in an observational study we need to do a lot more checks than in a randomized trial to see that we're meeting these assumptions because we're making a lot more assumptions.
But this is one where it's like, we can do this check and the answer might not really tell us what we actually wanted to know. Yes. No, absolutely. I think that even in a randomized trial, it is... still the case that randomization schemes that yield better balance on covariates yield a more precise effect estimate. So even though in terms of bias and identification, as long as you randomize, the simple difference in means or whatever is unbiased.
It doesn't mean that we can't improve our estimate by having a randomization scheme. yields better balance. An example of such a randomization scheme would be like a stratified randomization or a kind of matched pair randomization. And there are also re-randomization schemes where you re-randomize until you have a sample that... is better balance. And ideally, the kind of imprecision that is induced by the re-randomization is made up for by the improvement in balance.
In randomized trials, there is value in using a scheme that yields better balance. which is different than assessing whether you achieved balance in your individual randomization. How I see it is that these weighting methods that favor balance or move in the direction of balance are more like these randomization schemes. that guarantee balance rather than uh like being close to kind of
checking balance post hoc in your already randomized trial. Now, of course, it's not, the analogy isn't perfect, but that's just how I see it. So, you know, when people... run a randomized trial and they maybe have some imbalance in their covariates to say it's okay on average. The effect is still unbiased and it's consistent under weak assumptions. In an observational study, if you had covariates that were imbalanced after estimating the weight...
I think it would be completely reasonable for a critic to say, hey, you failed to balance this covariate. You need to estimate weights that actually balance it because you can't tell the difference between.
um a structural imbalance where this imbalance actually will like it's a it's a feature of your estimator that needs to be corrected versus just random change And when there is no imbalance left because you used a waiting method that guarantees it, you just kind of close the door to that critique. I feel like too, and this is like, these are my people, but I feel conflicted about the like emphasis.
i feel like we this is this always happens in like all situations where we swing one direction and then we swing way too far in the other direction and i feel like There have been some statisticians who I really respect, but they have really hammered home this idea that you don't need to show balance in a randomized trial to trust it because on average, we trust the process.
We are never doing these randomized trials more than once. What you have is that, often we're not doing them more than once, what you have is that sample that you had. So in some ways, as a practitioner, I don't care that if I repeated this 100 times, I would on average be balanced between my group. what i care about is this actual sample and if by chance i ended up with some imbalance in a covariate that really matters, which is possible in finite samples, even if you perfectly randomize.
That actually really could bias my estimate that I am estimate like in my in my actual sample. So I just sometimes feel like there's we've sort of like. we've got, we've hammered that message because I think it sort of started with this idea that we don't need to be putting p-values in table ones. We don't have to, you know, and I, and I believe in, in that. So I'm not trying to like push back against that notion.
But I do think we've in some ways swung too far in the opposite direction of being like, balance doesn't matter because you're going to get balance in the limit. And it's like, but we're never acting in the limit. We're always acting on my one trial. And, you know, I'm like, I'm on a.
study right now where we're trying to look at subgroup analyses in randomized trials and The method that we're proposing is one that implements some kind of matching within subgroups so that you have balanced subgroup, like you're only picking out subgroups after making sure that they're balanced. And a critique recently was like, well, in randomization, not only should you have balance between the treated controls, but in theory, if you had a large enough sample, everything should be balanced.
true yes in theory if i had an infinite sample every subgroup should also be balanced But I have a sample of 100 and my subgroup of like 10 is wildly imbalanced. And so, you know, it's I think that this idea of. what happens in the limit versus what happens in a finite sample. Balance is a nice way to help us trust what we're seeing in a finite sample, I guess. Yes, I completely agree. And I think it really is about trust, which is to say that it's like... not even necessarily a statistical
advantage, but like an epistemic advantage. And that's one reason I like matching and weighting methods as opposed to methods like doubly robust methods that focus on modeling the outcome or fitting these outcome models is that With some of those methods, you just have to trust that the method is working right. And you kind of only trust, the only evidence you have for that trust is the theoretical property of the estimator in the limit.
And what I like about matching and weighting is that you can assess how well the method did in your sample. And I think that that is good for science communication. It's good for protecting yourself against critique. And I think that that has evidentiary value.
that is distinct from the like statistical property of the estimator because science is you know it's a social endeavor and it has to do with the kind of beliefs of individual people do people trust your results and what can you do to make them trust your results and using methods that are kind of hypothetically In the limit, good is one way to do that, but it's effective for people who believe.
in those methods, whereas methods that allow you to show exactly what's going on in your sample, I find, are appealing to a lot of people who are concerned with the individual sample, the small sample. properties of your estimate. So that is really what drew me to these methods in the first place. I have nothing against the other type of method, but I think that... People who love these asymptotically consistent methods were focused on achieving this type of consistency in the limit.
They are sometimes a little dismissive of these like sample balancing methods because the sample balancing methods don't have these same asymptotic properties. But again, it's... trying to demonstrate to a doctor, a clinician, this is what you should use and this is the evidence from our study that suggests why you should do it. I think having something that they can understand is really valuable and that includes a concept that is very interpretable like balance.
It's funny you're saying that. I just, I was, I'm on this PCORI grant, which is like one of these patient-centered ones. And one of our tasks is we're trying, it's part of this. subgroup analysis but we're trying to also get some kind of like interpretable figure for patients to be able to engage with but also clinicians we have these stakeholder all these different stakeholders and in the last stakeholder meeting we were showing this image where we have like
sort of a tree to demonstrate the subgroups and then we have like a some kind of plot that shows the magnitude of effect within each of these subgroups and then we had these balance plots to demonstrate that that these were actually balanced also within each subgroup and it was Now that we're talking it through, it's like exactly what you just described. The patients looked at this and were like, we don't care about any of this.
The patients are basically like, just tell me what the subgroups are. I don't need a picture of like the bat. Like they just, they're like, get rid of all this math stuff. But the clinicians and actually even the statisticians were kind of like, we're not sure we need to see the balance. But the clinicians were they were the most focused on the balance within those groups because.
they've been burned before where someone says that this is a subgroup that matters and really it's not a subgroup that matters and they wanted to like it helped them trust the process more to see that and so i think um you know perhaps that answer a little bit is trying to like target our communications and how we're describing these to different audiences.
I think that that also is really relevant to an issue you brought up earlier, which is the choice between matching and weighting. You mentioned one reason you like matching is because clinicians understand it better. It just makes more sense. And I think weighting is kind of a little bit more abstract and it certainly is. But the power of like these visual diagnostics of balance is you can show that the kind of weighted distributions.
are similar after waiting. And I think that that's a really powerful image that... shows people like these weights aren't just some abstract quantity that are doing something. that hopefully is good. You can see that the distributions are now more similar to each other. And so I think the graphical tools that you've designed.
And the ones that are available in my packages to really like are hopefully help people understand that I found in my experience of trying to explain these more advanced methods to people that the kind of cool. derivative of the log likelihood stuff that fascinates me doesn't help them understand the method, but seeing the result of the method in terms of balance really does help them be excited about it and want to adopt it for themselves.
I totally agree. The one that I'm still working on, I think the... My favorite plots for showing these are those like if you could do like a mirrored histogram with the two groups and weight them. I think those really nicely show across like a single measure. So often I will use the propensity score, but you could use something else.
for that. And I think the love plots do a great job showing the standardized mean differences. The piece, and maybe you have an idea for this, but I don't want to take up too much more time. So maybe we'll wrap up shortly after this, but the one that I'm still trying to think about the best way to visualize. Right now I use empirical CDF. cumulative distribution function plots for looking at the distribution of individual variables and I'll look at weighted.
and ECDF plots for continuous variables to show balance. And I just find for some reason that picture, like the... their little eyes just glaze right over when I show them like, this is the full distribution of this. continuous covariate and look how close these lines are or look they cross or look there's some space and that one I just I've not yet found a way to show a full the full distribution of a continuous variable. That's one that I'm still kind of like trying to achieve.
Yeah. I think ECDF plots are really hard. They're really abstract because it's a plot of an integral and people don't really understand integrals. And that's fair enough. I don't think it's even worth explaining now, but it's just this kind of. quantity that's above, it's abstracted away from the distribution. So a plot that I really like is very similar to a mirrored histogram, but it's a kernel density plot.
And that also sounds way fancier than it actually is, but it's basically like a smooth histogram. It's a histogram where, you know, if you have a continuous variable, not everybody falls exactly into a bin. So if you would actually have all the unique values on the x-axis. most of the bins would be empty and a kind of kernel density gives you the smooth version of it and that allows you to see what the distribution looks like.
as it is and you can see how similar two distributions are and i think that a mirrored version of that is great too but i actually prefer the overlapping version of it where both kernel density plots are on top of each other because that way you can really see where they're different statistical measure that kind of quantifies the difference between two kernel density estimates, which is called the overlapping statistic.
And that, I think, can be really interpretable. It's not a mainstream balanced statistic yet. It could be. But that's something I really like. One problem with kernel density plots and the overlapping statistic is it requires like... tuning a parameter. Right. You just dictate what the smoothie... That's the part that I don't like about it. I know. But the same for a histogram because you have to set the number of bins or whatever. But I think...
You can make two kernel densities look identical if you just make them smooth a lot. That is true. That is true. So it's not a pure description of the data. There's a little bit of like... imputation of something going on there too. But I think generally it's a good approximate tool for getting a clue about what's going on. And it can be really useful for seeing kind of these extreme observations or parts of the distribution or features of the distribution, like maybe the variant.
very different and that's obvious in the kernel density plots because one of them has really fat tails and the other is more narrow and that you know you know that you need to focus on the variance that's the kind of thing that these plots allow you to show yeah that's okay i'm gonna put that in my get that in my mind great Are we out of time or should I talk about an estimation? I want to talk about, yeah, we could talk about estimation. I feel like.
Yeah, love it. It's something that, you know, I didn't actually learn about in my training in grad school, but I kind of picked up on my own later. But M-estimation, which sounds kind of fancy, is a way of performing inference, so estimating a standard error for a p-value or confidence interval. In particular, when you have a multi-step process.
for estimating your effect estimate, and that is true for propensity score of weighting. In the first step, you estimate the weights, and then the second step, you estimate the weighted treatment effect estimate. And what you ideally want to do is account for the uncertainty in estimating the weights in estimating the uncertainty of the treatment effect estimate. M estimation is just one way to do that. It's a nice analytical solution that allows you to kind of get this number.
the standard error of the treatment effect estimate in a clean way. There are other ways of doing it, one of which I mentioned previously called bootstrapping, where you kind of take a sample from your sample, perform your analysis in that new sample, and do that many, many times and the kind of empirical distribution of your estimates at the end of this many bootstraps.
can be used to quantify the uncertainty in your effect estimate, but that requires a lot of computational power. It's subject to this random process of resampling. And M-Estimation is just this nice analytical solution. It so far in the literature up until recently has kind of required you to manually program it. Like there aren't that many tools for doing it. There was this. package developed at UNC.
which was a kind of general purpose API to M-Estimation, which was great and I think a good step in the direction of making it more accessible. But you still had to program your own estimator, which is the thing you're trying to estimate, the formula to get the quantity that you want. And a feature that I'd recently added to WeightIt was support for kind of automatic M estimation when you estimate weights and estimate treatment effect estimates in the weighted sample using WeightIt.
And it allows you to automatically account for this uncertainty in estimating the weights. And this just kind of... has got me really excited because it's never been implemented in software before, in R anyway. And now we can finally get this correct uncertainty estimate, whereas previously we were using these approximations which Some of the approximations are conservative, which is okay. You're not making a huge problem, but you're unnecessarily decimating your precision.
just to have this conservative estimate. And in some cases, it's not even conservative. And so you really want to get this right. I am proud of myself for figuring this out because, you know, this just goes to show that after grad school, your training isn't done. You can still learn a lot more. And I feel like I've learned so much more since grad school because grad school set me up to learn and learning M. Estimation.
When I did it, I felt so empowered. I realized how so many things are connected and these big puzzle pieces that were really obscure to me all fit together in this nice, clean way. So M estimation works for basically any type of kind of parametric. weighting method and outcome model, and that includes logistic regression propensity scores, but it also includes the covariate balancing propensity score that we talked about earlier.
And it also includes some other weighting methods that we didn't talk about, like entropy balancing and inverse probability tilting, which are kind of cousins. So yeah, M-Estimation is this kind of, yeah, really cool method that has excited me. And my goal is to increase the M-Estimation capabilities and continue to make it more accessible to users.
Yeah, that's awesome. My second dissertation, we did like our program had the like three papers where you did three kind of papers and then tied them together with a little intro and conclusion for our dissertations. And my second. was on m estimation it was um deriving the um
the variance for the overlap weights in the original paper. They had written like that this was like trivial. They sort of were like, and you can get like, you know, you can, you can get standard errors, truly. You know, and I was like, oh, great. Yeah. And then I spent like a whole paper trying to figure out how to do it.
Um, and so I had coded it up for that. And I'm one of the pieces that I'm so impressed with Waitit was that I. I had hard coded, you know, part of the process is that you want to take a derivative, you know, you're, you're basically like have all these different. estimating equations across the different pieces that you're in. So we've got like the outcome model and the weighted model, and then you need to take derivatives of all these pieces and you get these nice.
It's analogous to like a sandwich estimator for people who are familiar with like, you know, when you're estimating like even just for an outcome model, you can get like a sandwich. confidence interval for that which is like tends to be an overestimation in the with respect to propensity scores of the um of the variance but anyway so it's it's analogous to that but you're doing it across several different models instead of just
One. And so the piece that I had struggled with in trying to generalize this was how do I do like what. Do I need to hard code a derivative for every kind of scenario? Is that what you do? Or do you do like a numerical derivation in the weeds besides me? Well, I care a lot about it. So I'm happy to talk about it. You know, because I don't have a lot of training in calculus, I didn't major in math. I majored in psychology. I really relied on numeric.
differentiation which is kind of a way of computing this uh calculus quantity the derivative that we mentioned earlier which is a part of m estimation um and there is a an analytic way of doing it which requires you to kind of derive
formula for every individual case or you can do it in a numeric way which is an approximation and I was relying on the numeric way what's cool about that is it means that you don't need to be able to compute this formula, this derivative yourself for every possible scenario. And that just makes it super generalizable. So in particular, you know, the derivative of parameters in the outcome model with respect to parameters in the propensity score model, that is hard analytically.
But doing it numerically is really easy. So initially, I did it all numerically. Recently, I have learned the calculus, which has been an adventure in itself. and also really empowering. And so I have, in the cases where it's possible, I've tried to hard code the calculus. Although there is a hard coding element, it's still modular.
And this is because some features of derivatives allow them to be modular. In particular, there's this rule called the chain rule, which says the derivative of a complex function is the product of certain derivatives.
And that allows you to kind of compute these derivatives separately in little pieces and then multiply them all together. So that's something that I've been working on. That just improves the speed and accuracy of those calculations. So you don't have to rely on the numerical result.
Yeah, I'm excited about it because, again, it was one of these empowering moments where I went from not really understanding calculus to figuring it out on my own and writing my own equations and seeing them work and match up with the numerical results and feeling like, yes, I did this. Yeah.
Yeah, that's great because that was my like, and part of it, maybe it's just a mental block, but my stumbling block when trying to like implement this was like, oh, I'm going to have to do so many derivatives. I don't want to take all these derivatives. I just sort of like.
But I'm glad that you did it. Yes. What's nice is a lot of these things have a kind of similar form to each other. And it's just like the specifics that are different, which is great. And then there are other qualities where even when the specific The kind of grand makers of R have understood that the derivatives of certain functions are really important to calculate. And the R developers have put those derivatives into R automatically for you.
in a nice structured systematic way. So I rely on that a lot as well. nice. I know R, one, I think utility of R for all this, you know, a lot of people are using Python. And Python is great too. So I'm not trying to like knock Python, but I think that R was built by statisticians, kind of for statisticians. There's a lot of things that... to me, are just natural.
of folks that are sort of trained in that background. Whereas in Python, it seems like there's lots of things that are natural if you come from more of a computer science type background, but not. as much built in necessarily that on the statistics side. Yeah. It's kind of the job of our package developers to make this statistical programming language accessible, not just to statisticians, but to researchers of all kinds who are analyzing data. That I think is like...
It's not exactly a weakness of R, but it's what distinguishes R from something like Stata or SaaS, where that has been done by the company itself is to make it accessible to users, not just statisticians. In R, it's up to R package authors to make it accessible. Fortunately, I am paid to do exactly that work, which is my favorite thing to do. That's what I used to do in my free time and now I get paid to do it.
And it's a dream for me and I really enjoy doing it. It's fulfilling because I know that these methods are now able to be used by people.
So yeah, R is beautiful in some ways and frustrating in other ways, but it's kind of all about this collaborative blend between programmers, statisticians, and applied users that... uh makes it really fun to work with yeah that is great i one maybe this is my last question on the m estimation is i i've um Being curious because in general, like when I was initially kind of learning these.
type of methods that what I had learned, and you can correct me if this is wrong, was that if you just used a sandwich estimator that only took into account the fact that you fit the outcome model, so like the final piece, as opposed to taking into account both the propensity score model and the outcome model. that that would be an overestimation of the variance and i guess my question do you is there are there scenarios where
I guess at this point, if it's all coded up, then maybe it's just always worth it to get the right variance. But in my mind, I'm sort of wondering if there's scenarios where is it like more worth it to go through the process of making sure you're estimating it correctly using both the...
kind of propensity score and the outcome model? And likewise, are there some scenarios where those two, while one might overestimate it slightly, the kind of straightforward sandwich estimator that's pretty easy to estimate and without needing any extra mechanic? is pretty close to the one that you might get. So for the ATE, the average treatment in the population, the sandwich estimator that treats the weights is fixed and just uses.
the outcome model is known to be conservative. To me, that's... still a problem conservative sounds good because you're you know you're not making one type of error But it's kind of similar to throwing away data. You're artificially... decreasing the precision that your data has just because you didn't want to implement this like slightly more correct procedure and no fault on people it wasn't available for a long time now it is and so we can fully account for that
and get the estimate that actually reflects the uncertainty. So that's one thing is that it's better because you're getting them more for your sample, more for your money. And I think that's like a more ethical use of... data that can be hard or expensive to collect. That's one. Two is for other S demands, it's not guaranteed to be conservative. Oh, okay. That's important. It is important. And this wasn't understood or known until kind of recently. And this was work by some UNC Biostat.
researchers, including Michael Hudgens, who showed that for the ATT, robust, the sandwich standard error that treats the weights as fixed can sometimes underestimate the variant. which is making a mistake that in statistics we really don't want to make, claiming you have more certainty than you do. And it's sometimes the case that it underestimates and sometimes the case that it overestimates. So you can't just rely on one and hope that you get it right.
and ATC, the average treatment effect in the controls. you definitely want to get that standard error correct because you have the potential to make a much graver mistake. statistically than in the case for the average treatment effect in the population.
for the ato i think it can go either way and it's uh maybe more likely to be conservative and so it's okay not to but again it's like you've done all this work you've collected all this data shouldn't you try to get your estimate to reflect the right the uncertainty that is correct and and make the most of your data so yeah i think it really is worth it um it's not just
a kind of curiosity or something like for the statisticians to worry about, but it has real world implications. It can be the difference between a significant and a non-significant result. It can be, but that can be true in either direction. You can fall through the state that you have. effect when you actually have less certainty than you claimed or you can claim that you have less certainty than you actually do and therefore not use your sample effect.
Oh, that's good. I'm going to have to look at Michael's paper because I, yes, happy to know that. Glad to no longer, we're about to get to this section in my causal inference class. So we will be scratching the advice that the sandwich estimators always. or the robust standard errors are always overestimating. This is good. Nice. So just sort of bringing the discussion on amasquimation to the close.
How does it compare to doing something like if you were just going to bootstrap? In terms of the statistical performance, actually bootstrapping is better, which is kind of frustrating because bootstrapping is this really simple but somehow miraculous method that really gets the uncertainty right. And it just works. And simulation studies frequently show that bootstrapping is the superior method for propensity score analysis in certain scenarios for getting the uncertainty correct.
But it has this problem, which is that you have to re-fit, re-estimate the weights and re-estimate the outcome model in the weighted sample in every bootstrap iteration. And bootstrapping works best the more bootstrap iterations you have. So it can sometimes be prohibitively computationally expensive to do a bootstrap. That's one problem with it. Another is that there are versions of the bootstrap that can cause your models to be unestimable. So the bootstrap involved.
sampling with replacement from your original sample. But let's say you have a rare level of some predictor, and in some bootstrap sample, that level is just completely absent. You can't estimate a parameter for that level, and then you can't estimate the model, and then the whole procedure crashes because even if you... are able to estimate the bootstrap estimate in some samples, if you can't estimate it in all, it could be that the ones...
for which you can't estimate it have some feature. It's a kind of like survival. bias or a selection bias in the bootstrap estimates. So you can't just like take the ones where it worked and claim that that's your uncertainty. Bootstrapping works in a statistical sense.
But in a practical sense, I think it can be not very good. And there are ways to improve it. And I've been working on those. There's a version that I am really partial to called the fractional weighted bootstrap, which is implemented in my... package fwb and also in waitit where instead of sampling with replacement, you draw weight.
from a distribution of weights and then estimate the effects in a weighted sample. And so the way you would do that with propensity score weighting is you multiply the propensity score weights by the kind of bootstrap weight. and you do that many, many times, you still have to do this process of estimating the weight.
and estimating your outcome model, you know, 10,000 times, and there can be estimation problems in that. So it's not like a panacea, but it can often solve some of the issues that the traditional bootstrap has. But if you want to avoid all that... M estimation definitely is valuable because it allows you to skip that. There's no random process in M estimation, which means that you'll get the same answer every time you do it.
It's not a function of the seed that you happen to set, which is true of bootstrapping, which requires this random process. And it kind of is analytical. It's the answer. It's the derived answer that... we appeal to when we want to claim a population statement about uncertainty of our estimators. So yeah, M-estimation I think is great. You know, there's many limitations of it. One of them is that you can't do M-estimation for all types of weight.
So if you're using machine learning, mEstimation is out the door, unfortunately. And it would be nice if we could develop mEstimation methods for that, but we just can't. There are some other... Weeding methods are not machine learning based, but still are not amenable to M-Estimation.
And unfortunately, with those, our choice is either use the robust the sandwich standard error that ignores uncertainty in the weights which could either be conservative or anti-conservative depending on the scenario or we can bootstrap which takes forever and can have these estimation problems. So it's not this perfect solution to all problems, but I think when it's available, it's worth doing. And there are simulation studies that show that if...
It does approximately as well as bootstrapping. So when it's available, do it. But when bootstrapping is available, it's better to bootstrap. But not everybody can wait all day. Yeah, bootstrapping is definitely computationally intensive. Yes, exactly. Especially if the waiting method itself is computationally intensive, then it's just, it's impossible.
we have to make trade-offs and sometimes it's getting something done in a time frame even if we sacrifice something small other times it's getting the right answer even if it means waiting forever or paying to get the right answer so yeah it's all about trade-offs everything is about trade-offs if there were no trade-offs there'd just be one option and we just do it all the time but there's
a million weighting methods there's a million ways to estimate the uncertainty there's a million estimators period whether you're choosing weighting or matching or w robust or whatever it is so It's all about how you choose to manage those trade-offs, and in particular, an eye towards how your audience...
would manage those trade offs if possible. Yeah, I think that's great. And I think this is maybe a good place to stop kind of nice brings everything together. So I want to thank you so much, Noah, for joining us today. It's really great to talk with you.
Thank you so much for having me. It was really an honor to be on this podcast. Big fan of both of you. Yeah, just love getting to talk about my favorite thing in the world, which is, if you can believe that, it's actually writing our packages and causal inference. And I know you guys can relate. Yeah, thank you again so much for having me on. Excellent. So that's a wrap on season six, episode three. Thank you again to our guest, Noah. Thank you to our listeners for listening. You can...
Find us on Blue Sky at casualinfer.bluesky.com. bluesky.social.com you can find Lucy at Lucy Stats and you can find me at F-E-L-E You can find the American Journal of Epidemiology at MGA Epi on Blue Sky. And you can always send us email at casualinfo at gmail.com. If you enjoyed our episode, please remember to leave us a five-star review so that we can have as much selection bias in our results as possible. And we'll talk to you next time.