¶ Introduction to Meta-Analysis Debates
Hello, everyone. Welcome back to the Pedagogy Non-Grata podcast. I'm joined for the second time with Dr. Dylan William, although I have to say it's been several years since we last spoke. I'm really pleased to have Dr. Dylan William back on the podcast today because I largely regard him as one of the leading experts in the world on the topic of education. Whenever he writes about education, he just writes with such nuance and depth that I'm routinely finding myself learning.
And I'm routinely finding myself impressed. So I'm really pleased to have him back on the podcast today. And he actually reached out to me and asked if I'd be interested in chatting after I... I wrote an article basically talking about the last time we had a podcast episode. And full disclosure, the last time he came on this podcast, he...
He gave a very thorough and excellent criticism of both meta-analysis and effect sizes, which if anyone has been a longtime follower of this podcast knows, that's usually my go-to source for analyzing education. If memory serves me correct, I was sort of flabbergasted by what he had to say. But actually, what he has said has been living rent-free in my mind ever since. And I've reanalyzed it over and over again in my head.
And while I'm not thoroughly convinced by all of his arguments, I still think all of his criticisms are 100% legitimate, even if I don't necessarily agree with the same solution, although I'm not sure I entirely understand his. solution to that, the weaknesses that exist within meta-analysis and effect size in general. So I'd love to get a better understanding from Dylan on this topic and to just learn from him in this conversation.
And I'm hoping that we can bring some nuanced discussion to this topic. And I'm just at the risk of going on and on before we go on any further. I will admit that this tends to be the topic where we get the least number of listens on the podcast or views when I write about this. But I think it's so valuable, important to understand.
how we identify what is evidence-based in education to begin with. Because if we don't have a level of consensus amongst scholars and teachers of what is evidence-based or how do we determine what is evidence-based, it's really difficult to have a meaningful conversation with each other.
¶ Defining Meta-Analysis and Effect Size
other. So without further ado, I'm going to ask if Dr. Dylan William can share with us. what a meta-analysis is and what an effect size is although i'm sure most listeners will know what those are by this point thanks nate well a meta-analysis is an attempt to systematically review the strength of evidence in a particular area and meta-analyses represent a huge improvement over what was done previously so for example in the use of um
anti-arrhythmic drugs in the treatment of people with myocardial infarction, heart attack. The earliest attempts to work out, you know, does it work, does it not work, was basically just a box-checking exercise. They just looked at all the studies. They looked at, well, this one says it works. This one says it doesn't work. And so they just basically tallied up the number of positives and the negatives. And if there were more positives than negatives, they said, well, the case is.
oops, strong that these drugs do help. The trouble with that approach is that it doesn't take into account the strength of the finding one way or the other. And so what surprises many medical researchers is that the whole idea of meta-analysis was invented by educationalists, principally Gene Glass, but his collaborators Mary Smith and Barry McGaw. and they they said well how can we actually draw together different findings and the trouble is that
Different research studies into the effects of, say, collaborative or cooperative learning use different measures. One might use one test, just a very crude example in the US. One study might use impact on SAT scores. one might look at impact on act scores sat scores range from 200 to 800. acts are using a much more compacted scale so you can't just compare the results from that scale the idea of meta-analysis was
Let's compare these studies by expressing all the different effects of different interventions, for example, student achievement on a common metric. And what we typically do is to take the... The mean score of the group that's had the new method of teaching, the group that hasn't, subtract one off the other and divide by some measure of spread in the population.
Ideally, we'd have a measure of the standard deviation of the entire population. We never have that because we never get the whole population. And so we actually have to estimate that. And that's why there are things like... hedges g and coen's d they're all different approaches to dealing with the fact that we want to divide by the spread in the population but we never have the data for the whole population so we have to estimate that in more or less sophisticated ways and so you know
I want to acknowledge that meta-analysis is a huge improvement on what was done before, because it's taking into account the strength of the findings. What worries me about meta-analysis is that, strictly speaking, effect sizes are only justifiable if they're unnecessary that sounds like a very cryptic and somewhat strong statement what i'm saying is this if i did three different experiments
on different interventions, collaborative learning, detracking, cognitive load theory. If I use the same test,
¶ Critiques of Effect Sizes in Education
to measure the impact on achievement in all three studies, I wouldn't need an effect size. I could just use the scores on the test as my measure. And that's why meta-analysis doesn't use effect sizes in medicine. In medicine, with a drug, for example, you would look at five-year survival rates. If you're looking at arthroscopic surgery, you might look at time to discharge from hospital or time to recover a certain range of movement in the limb that's been operated on.
So in medicine, because there's an agreement about what the measures are, meta-analysis doesn't require effect sizes. The trouble in education is there isn't that agreement. So we actually have to come up with a way of combining things that... maybe shouldn't be combined. And so there are a number of ways, the number of assumptions that we have to make in education to conclude that the effect sizes mean something.
I'm very happy to acknowledge, Nate, that it could be that my worries aren't actually legitimate. It could be that although these things might be issues, in practice, they are not issues. I'm just saying we don't know that. So we could both be right. You could be right by saying the conclusions of these meta-analyses are correct. And I'm saying, yes, but that's because you got lucky. Because the effect sizes just happened to actually match the reality.
So my concern is that we often might get right for the wrong reasons, but in terms of really understanding what we're doing, I think we need more insight into when effect sizes can be compared. and when they can't and that's my concern with meta-analyses in general is because they often don't look at the quality of the studies they don't look at the kinds of measures that were used and they don't look at whether those things are really comparable yeah
No, those are all really good points. And truthfully, I agree with most of them. Although you phrase at the beginning about what we previously did. And we did have literature reviews before meta-analysis. And you used this term the first time I talked to you about it, this idea of the sage on the stage, what we did before.
And my concern is that in education, we've been so reliant on a celebrity culture in education. You know, we have this really wise person who goes out, they read all the literature and they tell us what science says. And I have a lot of problems with that, to be honest. For one, I don't think it's a particularly transparent method. I agree. For another, I think oftentimes the people who...
claim to be experts have actually just been really good at marketing and not necessarily been experts. And I think it's really easy to put a lot of bias in that system. So I like meta-analysis better. because i think it is slightly less biased and i i think it is um slightly more transparent um but i do agree with all of those criticisms so let's
Let's maybe we could unpack some of those criticisms. And I will just go on to say before I ask you that my hypothesis would be just that meta-analysis is a deeply flawed methodology. but it is the least flawed of the possible choices we have. Least worst. No, absolutely. And in fact, I think that you're absolutely right. And I've written about this.
The problem with the old narrative approach to research synthesis is that, yeah, somebody kind of wraps their head in a cold towel and kind of absorbs the research and writes a synthesis. You can't update that. The whole process has to be done again. The great thing about meta-analysis is that as new studies are conducted, you can feed the new studies into the meta-analysis and update. So the idea of meta-analysis...
¶ Publication Bias and Research Gaps
provides for a constant updating. So I think I probably agree with you that meta-analysis is the least worst way of synthesizing research results. What I'm saying is currently, I think that the faith we put in meta-analyses is too great. given that most meta-analyses don't avoid the problems that are avoidable and don't discuss the problems that are unavoidable. Yeah. Well, in my experience, I think people are doing a bad job identifying.
the problems in meta-analysis. I think you actually do the best job of pointing out the problems. And full disclosure bias, I am trying to peer review format analyses right now. And I'm finding it a terribly long process, if I'm being honest. But what I find my reviewers are really focused on is the literature review of the previous studies.
that have tried to synthesize the data. Whereas I actually think what matters is the statistical methods used and the type of comparisons that are being made. But I find it's very rare that people, reviewers, make comments about the types of comparisons being made. They're much more focused on the language being used to describe either the previous research or the...
results. What do you see as being the biggest weaknesses of meta-analysis? Where do we see meta-analysis authors making errors, so to speak? Well, I think it's changed. So I think previously, I think publication bias...
was a massive source of error in matter analysis. And by error, I don't mean it in the mistake sense. I mean just unexplained variation. So I think people just treated... all the studies that were published as being a random selection of all the studies that might have been published and we now know that that's not true it's much easier to get significant results published one result from health education is it's 12 times easier to get a result published
if the result is significant. And so now I think routinely we see meta-analyses including checks for publication bias through the use of a funnel plot. And so now, you know... I think it's quite rare to see meta-analyses that do not include some attempt to control for publication bias. Are the biggest effects being found in the studies with a smaller sample size, for example, which is a big clue that...
There's a fluke effect there. And so we can detect those things through funnel plots. We can then also correct those mistakes by imputing the values that would be obtained if publication bias wasn't a thing. So we're getting a lot better.
So I think publication bias is better. I think the moderators of effect... that we're still lacking often our age of the students we tend to get larger effect sizes with younger children because when we do the effect size calculation the denominator of the effect size calculation the standard deviation of the population
¶ The Impact of Assessment Sensitivity
is smaller with younger students and therefore you get greater effect size. So my first question when I look at a meta-analysis is did they do an analysis of whether the larger effect sizes were found for younger children? And so that's one of my first things I look for. And that's becoming more common as well. The one I think that we haven't got a handle on is the sensitivity of the assessment to the effects of instruction. And this is...
hard to deal with because it's usually not reported. It's conceptually hard to get one's head around. But the idea is that some assessments are very sensitive to what teachers change, and some assessments are relatively insensitive to what teachers change. And I think this isn't appreciated because people think that assessments are just assessments. But there's a kind of quirk here. You see, we want our assessments to be reliable.
And the best way to make our assessments more reliable is to stretch the student scores out over the whole mark range from zero to 100. So it turns out that if you have an item in a test that all seventh graders answer incorrectly.
at the beginning of seventh grade, and every seventh grader answers correctly at the end of seventh grade, that's clearly measuring something that teachers who teach seventh graders are teaching effectively. None of the kids could do it at the beginning of seventh grade.
All the kids can do it at the end of seventh grade. That item will never get included in a test for seventh graders because it doesn't discriminate at the beginning of the year and it doesn't discriminate at the end of the year. You will improve the reliability of your test.
by deleting that item and replacing it with an item that 50% of the kids get right at the beginning of the year and 50% of the kids get right at the end of the year. It's not measuring something that you're changing, but it is making the assessment more reliable. So the process of test construction often removes those very items that are measuring what it is that teachers change. And that's why most standardized assessments are relatively insensitive to the effects of instruction.
And that's my worry, is that we actually are valorizing the assessments that focus on more like IQ tests rather than measuring the things that effective teachers teach well. Yeah, I completely agree with you. I think that that is something that I hope all meta-analyses do in the future. And in fact, of the four meta-analyses I have submitted for peer review, we have tried to control for that very factor that you're discussing.
And the thing that really tipped me off is I noticed when looking at comprehension studies that when the researchers used proximal assessments or more sensitive assessments, that the results were four times higher. And I thought, well, if we get four times higher effect sizes on one type of assessment, we cannot compare that to a distal assessment or a standardized assessment and say that those effect sizes are comparable. And I would 100% agree.
But that's not. And in fact, I kind of hope that all future meta-analyses do control for this factor because I think it's been a problem that we haven't controlled for this in years past. Absolutely. In fact, your result of four times is not extreme. So Maria Ruiz Primo and her colleagues, when they were looking at the effects of feedback in science education, they found that proximal assessments...
produced six times, sorry, one sixth the effect of close assessment. So the close assessment, the closer it was to the actual intervention, the effect size was six times greater than if you used... a more uh proximal kind of nearby assessment so there's no doubt that the sensitivity of the test to what it is you're changing makes a big difference the effect size and i'm not saying that more sensitive measures are better
Because if your assessment just reproduces the intervention, then we're not learning anything. What we want to know is whether students can take what they've learned in one context and apply in a different context. So some degree of remoteness from the intervention is essential to establishing whether this assessment is, whether this intervention is worth having at all.
¶ Quality of Studies and Moderator Analysis
The problem is it does tend to lead to an, you know, the closer it is to the intervention, the greater the effect size. And therefore... the more likely you are to conclude that an intervention has a big effect. And I'll go back to what you were saying earlier. You know, this may seem a geeky, in the weeds, inside baseball kind of idea. But unless we have a real handle on how much...
improvement we will get in student achievement if we do these things, then we can't do school reform systematically. We need to work out how much extra achievement will you get if we do this thing and how much extra time will it take?
before we can make any informed decisions. So I would say that far from being a kind of nerdy issue, this is at the heart of effective school improvement, because we need to know how much will this help our students. And without a handle on that, basically, we're blind. Yeah, I completely agree. Although I'm going to come back to this in a moment, but.
Another type of comparison that I've seen made in the literature, especially in older meta-analyses, and in fact, recently I reviewed all the meta-analyses on ESL instruction, and I found that a large number, it was the majority of ESL instruction meta-analyses. had not controlled for whether or not the study was experimental or a case study, meaning was there a control group or not a control group. And I don't, you know, I think there is a place for investigating studies without control groups.
However, I feel like that's for hypothesis generation and comparing an RCT with standardized assessments. to a case study where there's no control group and a proximal assessment is not even in all the same ballpark of an analysis or comparison. So I don't understand why we used to smush those studies together.
Now, admittedly, I think modern studies aren't doing this for the most part, but I think it is deeply problematic that some of those older meta-analyses have that practice. Absolutely. And you use the term control, and I think that that's a kind of... statistical term that we talk about using it as a covariant for example um that's one way of doing it i actually prefer that they just do moderator analyses in other words they'd report you know what's the average effect size
for randomized controlled trials? What's the effect size for correlational studies? and let the reader themselves decide whether this result is trustworthy. And often you find that the effects aren't that different for different kinds of research studies. Sometimes you find they're very different. So I think there's two approaches. One is to control for these things through something like meta regression. The other approach is to say,
let's just report what we found and let's make the allow the reader to draw their conclusions about whether these things make a difference but you're absolutely right there has been in the past and sometimes there still is a kind of lack of critical study of the kind of research that was done, whether the experimental group was reliably different from the control group, you know, does that comparison make sense? What assumptions do we have to make?
conclude that that result has particular meaning yeah i i i generally speaking completely agree with you um and you know regression analysis moderate analysis is something i have a love hate relationship with because It's something we can get these really specific comparisons, especially with regression analysis. We can get these really specific comparisons. The problem I have with regression analysis, and admittedly, all of my papers have included one.
is that we end up with this really small pool of studies for each variable. And I'm not entirely convinced that those variables mean anything in isolation because, you know... I have one study over here and one study over there with different variables. And when you look at these really specific comparisons, you end up looking at a really small number of studies.
¶ Small Samples and Critical Variables
And then when you look at the effects going across the regression analysis, they often look really random. And I'm always surprised at how random they look. Right. you know this is this is the the age-old problem with what works clearinghouse the stricter your standards are the smaller your sample is and therefore you're saying
And in a way, it's very artificial. You're saying, if you don't meet this specific threshold that I've decided is going to be imposed, then your results are meaningless. And I'm looking for ways of actually including more of the studies in the kind of...
in analysis but then saying there's a degree of subjectivity in how we actually look at these results and i think that's really important in a whole range of fields but there's there's the issue of publication bias there's also the issue of whether it's a multi-site study So Kvavin and her colleagues looked at key findings in psychology and whether the studies were pre-registered and multi-site.
And when they restricted their analysis to just the pre-registered multi-site studies, the effect sizes they found were one third as great as the results that were actually published in the meta-analyses. Meta-analyses often lump together lots of findings that... probably shouldn't be loved together. And we need to be more critical in reading the meta-analyses. And the trouble is that a lot of the people doing the meta-analyses don't report these details.
So we can't do any further investigation. We don't know whether the effect sizes were different in multi-site studies versus single-site studies because the researchers didn't think to include that as a moderator variable. So I think we are getting better.
I think we now are including a more comprehensive range of moderator variables in the meta-analyses, but I think we have still a long way to go in terms of getting the right set of moderators. And I would like... journal editors to have a kind of checklist of moderator analyses and if the review doesn't include all these potential um
doesn't look at all these moderator variables. It's just returned and say, we're not going to review this until you've looked at the standard set of moderator variables that we think everybody should be looking at. I think that's the way to improve. I really like that idea. In fact, I love that idea. I wish the journal had asked me that in their review process. And one of the things that...
I look for when I read an analysis is did they include a sophisticated monitor analysis? That's like the number one thing I check. Absolutely. And then the second thing I check is usually the inclusion criteria. And I think you're right. We can either have a really strict inclusion criteria. or a really detailed moderator analysis. But if both of those things are lacking, it's a gigantic red flag for me. Absolutely. But I think the moderator analysis checklist is interesting.
because it depends on your knowledge of the field. So your knowledge, so my favorite example is educational research in Iceland. which is a very obscure field, okay? And so I ask my students when I'm teaching this, would you want a balance of government schools versus private schools? Everybody says yes.
Don't forget, every time you say yes, you've doubled the cost of your experiment. Would you want a mixture of rural and urban schools? Yes, they say. Would you want a mixture of coastal and inland schools?
¶ Reinterpreting Effect Sizes and Power
And most people say by that point, no. But of course, in Iceland, it's one of the crucial variables because... The inland schools are agricultural, making their living from farming, which is very hard going. The coastal schools make their living from fisheries, which are much more affluent. And so it turns out that inland versus coastal...
which is a variable that most people would not think to include, and it doesn't make a difference in most countries, turns out to be crucial in Iceland. So your knowledge of what's being researched is essential if you're going to critically scrutinize the list of moderators. Yeah. Yeah, I wanted to go back to one variable you were talking about, and that was the multi-site and independent funding. And I've seen a lot of people bring up the craft paper recently on interpreting effect sizes.
And he suggested a very small effect size could be statistically significant. But he was specifically talking about independent studies on older students. And, you know, I think that might be being misapplied. to all studies. Why do we expect independent studies to be producing lower effect sizes? Well, I think Matt's right. I think that we've got used to ridiculously large effect sizes. And I lay the blame at Jacob Cohen's door. And he's always the person who's quoted.
When people say an effect size of 0.2 to 0.4 is small or whatever, 0.4 to 0.8 is medium and over 0.8 is large. What people don't understand, and I hope that after this, nobody will quote Jacob Cohen as a source of guidance on effect sizes. In his book on statistical power for the social sciences, he wasn't trying to give a guide to interpreting effect sizes. He was giving a guide to what effect size to use in a power calculation.
So he was saying that if you think this effect that you're researching is a small effect, then assume an effect size of .2. If you think it's a big effect, assume an effect size of .8. in determining how big your experiment needs to be to find a significant effect if your effect is real. So Cohen wasn't talking about how to interpret effect sizes. He was talking about how to assume effect sizes.
for power calculations. And he was mostly focusing on psychology. And I think that's done a huge amount of damage when we look at it in educational systems, when we're looking at tests like...
state-mandated tests, where one year's progress is quite large, 1.5 standard deviations for six-year-olds, but typically around 0.2, 0.3 for 15-year-olds. So I think Matthew's point... in the paper, which I think is very important, is that effect sizes of 0.08, which is what Victoria Sisk and her colleagues discovered for growth mindset interventions, which most people dismiss as trivial.
Well, just think about the fact that most of these growth mindset interventions were done with children over the age of 10. So one year's progress is 0.4 standard deviations at most. So 0.08 effect size. is a 20% increase in the rate of learning achieved with a one-hour growth mindset intervention. So I think Matthew's paper is important in redressing the balance about what people regard as significant. And 0.08.
can be an important effect size. It can be educationally significant. The problem is that you need a very large experiment if 0.08 is going to be a significant effect size.
A, because it's a very small size anyway, but B, because in education, you often have clustering in the data. So the N isn't the number of students in the study. The N is the number of teachers in the study. Because... different teachers implement the same intervention in different ways so you know to use the students as the n in the effect size calculation
In a significance calculation, you need to assume that all students are independent of each other. If they're in the same classroom, they're not independent of each other. They are more similar than two kids in two different teachers' classrooms. And the real problem is we often find that for the kinds of effect sizes we can reasonably expect from educational interventions, with the kinds of assumptions about real schools.
you very quickly find that well over 100 schools are needed to have meaningful, educationally significant effect sizes to be significant. in the results if those effects are real and that that's the problem people don't really appreciate the one estimate i've seen is the average power of an educational experiment is about 0.4
typically what we find in psychology. In other words, there's only a 40% chance that this experiment will yield a significant result, even if the result they're exploring, the effect they're exploring, is real.
¶ Addressing Publication Bias with Pre-registration
So most educational experiments fail because the results are too small. It's like tossing a coin 20 times when it's only, you know. 0.6, 0.4 biased, you're not going to get a significant result because you haven't done enough coin tosses. I think that's one of the problems we have with education research is the clustering of the data.
And the small size of meaningful effects means that educational experiments have to be very large and much larger than people generally appreciate. Well, there's two things in there I'd really like to unpack. And I feel like I have a little bit of bias here in that. I've spent a lot of time looking at small sponsored studies with proximal assessments. So when I see a small sponsored study with a proximal assessment, I really want to see a large effect size before I think this means anything.
And truthfully, I don't know that any single study alone means anything. But because, you know, they've sort of stacked the deck in their favor. Because, you know, we have this small sample and it's sponsored. I kind of automatically assume. there's a solid chance that they did four of these experiments and they published the one with the best results. And I had some funnel plot analyses to support that hypothesis. Absolutely. It's just...
So when we talk about, say, a study on a school district, on growth mindset, on a standardized assessment, having an effect size of 0.05, I can believe that means something, especially if we have a p-value that's very low. But when I see a study on one class with a, you know, a very proximal assessment and it's sponsored and the ICN effect size of less than 0.20, I'm really unimpressed.
Because the average study I see that has that type of study design usually has a result of 0.4 or higher. So I think we really have to be considering the context. And that's one thing I worry about the craft paper. is that I see people really taking that to mean all studies. If any study shows an effect size of 0.05 or higher, it means something positive. But as you pointed out at the beginning of this conversation.
Most studies that get published in peer-reviewed journals actually have positive results. And one moderator analysis I didn't report on in a meta-analysis I submitted on phonics was that the peer-reviewed studies actually on average showed higher results.
than the non-peer-reviewed studies, because I included both peer-reviewed and non-peer-reviewed studies. And I think that's why if they get a good result, they're more motivated to go through a journal. Whereas if they get a bad result, they might. slap it up on their website, but they might not necessarily go through the process of peer reviewing it because they're not as proud of the result. And, you know, Jacob Cohen, in his original 1988 book, acknowledged the risk.
in having preset effect sizes and he he said nevertheless i think this guide there's helpfulness in providing guidance but i think you're absolutely right you know the the craft guidance makes sense if you're restricting yourself to standardized, relatively insensitive distal assessments. But it's completely misleading if you've got an intervention that's been conducted over a much shorter timescale.
I mean, I share your concern about publication bias, which is why I think pre-registration is so important. I'm beginning to recognize that in medicine now. You know, the trouble in medicine is that a drug manufacturer can... test the same drug 20 times find one significant effect even though it's not working because of chance variation and publish only that result and there's a guy called ben goldacre in england who's been trying to get
people to register their trials and even when failure to pre-register or failure to actually follow the recommendations of the pre-registration pointed out journal editors are still not willing to actually address these issues so we have a real problem all the way through the system where the research is conducted and the way the research is published so you know i would like to see
Journal editors in education routinely requiring pre-registration. Yeah, I agree. Basically, if it wasn't pre-registered, we're going to assume that you did this experiment 20 times.
¶ Reproducibility, Sample Size, and Control
You know, this is a wonderful idea of from based on a by Andrew Gelman from a book by a short story by Jorge Luis Borges called The Garden of Forking Paths. You know, I want to know. the decisions the researcher took in doing the analysis. And you mentioned this idea of reproducibility or replicability earlier on. When Paul Black and I did our narrative review, but this is back, you know, 25 years ago.
One of the things we tried to do was to make our review of research replicable. So we actually documented exactly how we searched the journals. We listed the journals we had searched. We talked about how we had done the review so that others could come. could follow behind and decide if they agreed with the methods we'd used. And I think we talk about replication in psychology and education as being really important. I think we need to think about replication.
in meta-analysis as well would a different meta-analyst coming along with the same question but use the similar findings and i think people are now very good at reporting the selection criteria yeah yeah i agree with all of that and i you know honestly i think
Publication bias, not so much in terms of the meta-analysis authors, but in terms of the people doing the underlying studies, is probably my number one concern. It's just I assume that, for lack of a better word, there's a lot of bullshit in education research. because we have so much publication bias. I would assume that if we had independent pre-registered studies, that the average effect size would come down more than half. That would be my guess.
Right. So a concrete suggestion would be, because most people who do empirical research at universities need to have this approved by an ethics committee, that the ethics committee should require pre-registration. and that any study funded by public funds whatever the funding agency is if it's public money if it's you know if it's a government or agency then again pre-registration should be required
so that we can make sense of the results. How many bites of the cherry did they get? Because without that information, you can't make sense of the result. Well, let's talk about something that... You and I have actually chatted a little bit about it on Twitter. And you mentioned it the first time I ever interviewed you. And I don't think I fully understood the topic the first time. I'm not sample size. I think there's this idea, but...
The larger a study is, the more accurate it is. And there's a kernel of truth in that for sure, I think, because we have a more representative sample. But my concern with the really large sample studies is that they're often comparing business as usual. Whereas, you know, which means we have a ton of experimental variables. And I kind of assume that a really large study has sort of a random result sometimes because...
we can't really isolate the variable we want to isolate. Whereas some of these really small studies, they sometimes have control for everything. You know, the teacher in both groups is the same. The instruction is the same minus the one difference that we're testing. And I actually, I really don't know what I like better because I can really see the pros and cons of each type of study. Absolutely.
You know, I like Doug Rohrer's study with his colleagues on the effects of interlead practice, because there they got teachers in Florida to teach their honors seventh grade classes. And so each teacher, typically they had four classes. Each teacher taught two of their classes one way and two of the classes the other way, using interleaved versus blocked or massed practice. And so you really...
knocked out the idea that the teachers in the two groups were different. You've got the same teachers teaching two different methods. But of course, you haven't then got the generalizability to other teachers. You've reduced the number of teachers you're studying. There's always this trade-off.
¶ Bigger Sample Size Not Always Better
I'm very critical of people who reject a study because they say the sample size is too small. That's not a legitimate basis for rejecting the results. If the result is significant, then you can't say... the sample size is too small. The correct criticism is that it's unrepresentative. My favorite example of that is the 1936 Literary Digest poll in the United States.
Literary Digest included a little postcard to all his readers asking him to say which way they were going to vote in the coming presidential election in the United States. And their responses suggested that Landon... would win an overwhelming majority. A man called Gallup had actually decided to do some polling of his own, but he made sure to get his much smaller sample representative of the American population.
And he realized that Landon wouldn't win. And so it's a lovely example of the fact that bigger does not necessarily mean better. So if both samples are equally representative, I want a bigger sample. The danger is the bigger sample is often less representative of the population you want to generalize to. And therefore, sometimes the smaller sample, if it's genuinely randomly drawn, is better than the large convenient sample that people often claim.
is superior. Yeah. Well, that sort of makes me think about one of the problems i think that we have in any type of statistical analysis of education results is that there's a lot of noise um you know
I think sometimes when people read a meta-analysis, especially when they're looking at the modern analyses, they assume a really precise difference in these effect sizes. But, you know, honestly, as someone who's tried to do several meta-analysis, now none of mine are peer-reviewed yet. I'm trying to peer-review them.
But I assume a lot of that data is actually just random noise differences. And the nice thing that I think about a really large sample or even a meta-analysis with a very large number of studies in it. is i think it kind of helps to sort out the noise and i'll give an extreme example of john hattie who in my last interview with you you talked about some criticism of his methods and actually i agree with every one of your criticisms and yet
And yet I often think that when I look at the mean effect size he finds for an intervention, I often then think about what meta-analysis was the best on that topic. And I actually find that it's often very close. And I think what's going on here is that because the sample size is just ginormous, we sort of somehow managed to get through some of the random noise almost accidentally.
And I think the same thing can happen when we have that really large RCT with, you know, 10,000 students or 3,000 students per group. It's almost as if because the standard deviation is not as easily influenced by outliers. we sort of get more to the center of what we'd actually expect that outcome to be. What are your thoughts on that? Well, I think that sometimes you get lucky. And I think my criticism of John Hattie is that he's assuming that the studies that get done...
and get published are a random sample of all the studies that might have got done and might have got published. So Leif Nelson, in his Data Collada blog, points out that the true effect size does not exist because to get the true effect size you need to actually find the average of all the experiments that were significant and reported all the experiments that were non-significant and reported
all the experiments that were significant and not reported, all the experiments that were not significant and not reported, all the experiments that were not conducted but would have been significant if they had been conducted, and so on. So basically...
¶ Weighting, Robustness, and Researcher Bias
The fact is that the experiments that get conducted are not a random sample of all the experiments that might have got conducted. And typically we find it's the cheapest experiments that get conducted. So, for example, John Hattie records that class size doesn't make much of a difference. And I think he's right about that in terms of the published research, because the class size reduction studies that get published typically have fairly small.
amounts of class size reduction, typically of the order of 10, 15 percent. And don't combine professional development for the teachers in taking advantage of the smaller classes. So to conclude the class size doesn't work. is to conclude that the studies that have been conducted are a random sample of all the studies that might have been conducted. And I think that's where it falls down. I think class size reduction can be effective, especially if...
teachers are given support and taking advantage of the smaller classes. So my problem with the meta-analysis is the weighting of the study basically depends on the standard error. What I'm saying is...
I want other factors to be taken into account in weighting these studies, like does this study mirror what I know about good practice, for example? And then the difficulty, it becomes very subjective, but that would then... tell you whether this you know could be a difference and the trouble with most educational research it never tells you what might be educational research just tells you what was and so it's the danger is concluding the research that was conducted
is a random sample of that which might be conducted. And I think that's an unsafe assumption. Yeah, I would agree with literally everything you just said, actually. One thing that, this is a very nerdy question. Okay. But I'm talking about waiting. Waiting is something that always makes me a little uncomfortable. And I always report my effects as both weighted and unweighted now.
Yeah. And with the weighting method, I have been using a standard error just because it seems to be the most commonly used one. But my concern with standard error weighting is that whenever you weight an effect size, in a sense, you're actually changing the results of the study.
And that always makes me a little uncomfortable to say, well, because this study had a smaller standard error, I'm going to change the weight of it to be bigger. Or because the study had a higher standard error, I'm going to change the weight of it to be lower. I'm truthfully, I'm torn on whether or not that is a valid practice. I like the idea better in a way of just including non-weighted but doing a regression analysis and showing how those results changed across.
different variables i'm curious to hear your thoughts no i think it's absolutely right i think goes back to the point we were making earlier let's present the evidence and allow the reader to draw their own conclusions. And so I thought, I think, how much of a difference does waiting versus not waiting make to the results? If it makes a big difference, well, why? I want to see why that is. If the weighted result is much, much more bigger effect size.
Is it a particular study that's skewing the whole analysis? So I think that anything that allows the reader to make choices about how they interpret the data, I'm in favour of, and certainly I've been very much in favour of reporting both weighted and unweighted analyses. Just so we can say, does that make a difference? And I can say for sure, sometimes it makes a huge difference. Yeah. And therefore, how much faith do I put in the results? You know, it gives me caution.
If the gardener forking paths decisions, if the decision the researcher took makes that much of a difference, then I'm concerned. And that's why I think that typically we have... one forking path where the researcher just takes all these decisions and you don't know whether they were just taken, you know, in the absence of the knowledge of the results or whether the researcher is harking, hypothesizing after results are known.
Are they making those decisions according to what is likely to maximize the statistical significance of the result so they can actually get the result published? So I think that for me, anything that minimizes...
the number of decisions that the researcher takes in following through to the conclusion. You have to make certain assumptions, otherwise the analysis becomes unwieldy. But I think allowing the reader... to make a decision about how do I feel about the unweighted versus the weighted results, I think in general leads to more intelligent analysis or interpretation of the results. Yeah, I agree. And I will admit that sometimes I have made coding decisions just because I can't handle making another.
¶ Multiple Meta-Analyses and RCT Limits
layer of analysis because every time you add another like layer or fork at the path essentially you double your work absolutely so if you put in 16 uh types of analysis man you got a lot of work ahead of you um absolutely And excluding outliers. Again, good reason to do it. But often, you know, I mean, I think that...
That's where the pre-registration is really important. If you declare in advance the kinds of decisions you're going to take before you see the data, then I think you end up with much more trustworthy analyses. If the researcher is hypothesizing after results are known, harking, well, if I just delete this value, I'll find a significant effect. That's when it gets a bit dodgy. I suspect that an awful lot of that happens. I'm sure. And I'd like to minimize that through pre-registration.
it's not the pre-registration so much it's the researcher thinking through what analyses they're going to do before they see the data if that's been done then i'm very much more comfortable with the results that's a really good point
You know, you said something earlier on in the conversation that I didn't go back to it all. And it's been bothering me that I haven't had a chance to go back to it. So I'm just going to make a completely side tangent point. And that you talked about how it's important to have replication event analyses.
And I had a conversation a couple of years ago with Dr. Corey Peltier, and I said to him, you know, I don't really trust something as science until we've had a meta-analysis on it showing a significant effect. And he replied to me, I don't trust something science unless we've had multiple meta-analyses showing the same result. And I think he was right. Despite the fact that I hope my meta-analysis passed peer review, I would hate.
for anyone to read one of those papers and be like, well, this is now my only definitive guide as to whether or not this works. I actually agree that we need multiple meta-analyses on each topic just because it is such a beast. of a project to take on a meta-analysis. And there's so many decisions that you need to make that I really do want to see replication of meta-analyses in general. Right. So those replications will give us a handle.
on the robustness of those findings to different sets of assumptions. So that if different researchers making different sets of assumptions produce similar findings, we can be relatively confident that the result did not depend on the particular decisions taken by that researcher. without multiple replications in the same field we can't know that we don't know whether a different research would have produced a different result and i think you know if you're talking about science
And, you know, we can say that education need not be a science, but if you were talking about science, then I think replicability is crucial. The idea is that the results shouldn't depend on the particular whims of that particular researcher. The idea is it's knowledge without a knower in science. And I think that's really important. And there are places where, in education, I publish things myself.
which rely on the particular experiences of the person that you're investigating. Single case study research, I think, has a role to play. All these things, I think, are important. But if you're attempting to put a quantitative value... on the effects of particular kinds of interventions, you need to know whether somebody else investigating the same field would find something similar, because otherwise you can't have, you don't know how much faith to put in the result.
Yeah, I completely agree. One question that I see come up a lot, and especially, I think, from a lack of a better word, disciples of Dr. Robert Slavin, who... by the way, it was brilliant, although I disagree with a lot of what he said, is this idea that an RCT is more valuable than a meta-analysis. And I thought, I'm curious as to your thoughts on that.
I personally think it's a bit of a dichotomy, but maybe I should say that before I've heard your answer. I mean, I just think it's an extraordinary statement because my initial reaction is, depends on the RCT. And it depends on the meta-analysis. So I can think of some RCTs that are basically kind of definitive and that they are so broad in scale. The thing is that RCTs...
prioritise certain kinds of... Let me take a step back. So when we interpret a result of an experiment, we want to conclude that it has a particular meaning. And there might be different ways of... of interpreting the results. So when I talk about this, when I teach this, I talk about the ah, but what if. So the ah, but what if, you know, I teach this new mathematical approach to these two different groups of students.
And my preferred method produces better results. Ah, but what if the students that I taught in the old way had a lower prior achievement? Ah, well, I did a benchmark. I did a pretest. Okay. But how do we know if you treated them differently? How do they know they weren't different in some other way? And that's the logic of the RCT. The logic of the RCT is the treatment group, sorry, the control group stands in for the treatment group who didn't get the treatment.
We always want to look at the students who are treated in this new way and what would have happened if we taught them in the old way. And in the RCT, the treatment group, we give them new... experiment, treatment, the control group stands in as a proxy for the treatment group who didn't get the new way of teaching. So conclusions about the treatment group versus the experimental group.
are pretty sound so if my experimental group outperforms the control group by a certain margin i can be reasonably sure given the size of my experiment that that result was not a fluke I can conclude that the experimental group was significantly different from the control group. I am not entitled to conclude that that finding would generalize to other groups. And this is where RCTs go wrong. You see, we forget.
that the schools that are or the teachers who agree to be involved in an rcdt and get assigned to the treatment group, they're not the same as the people who did not volunteer to be included in the experiment in the first place.
¶ Generalizability and Nuance in Research
So the only generalizations that are warranted with an RCT are comparing the treatment group with the control group. Generalizations beyond those two groups are not warranted. because we don't know that they're representative, because they volunteered to be in the experiment before they were assigned to the control group. So the meta-analysis actually provides...
better support for generalizability because it's researched more different cases, more different contexts. So I can certainly envision situations in which the meta-analysis would be superior to the RCT. because it's got more different contexts of application. So I think it's a much more nuanced thing than just one's better than the other. I mean, you know, we're always trying to generalize from the context in which we did the research to other contexts.
Yeah, I completely agree with everything you just said. And, you know, I can think of some rare circumstances in which an RCT might be more valuable. And since, you know, if I have a meta-analysis from the 1980s on 10 case studies. or an RCT published, you know, last year with, you know, 5,000 students in it. I might say the RCT is more likely to be accurate. But I think that's a pretty extreme.
hypothetical example I've created. And I think it'd be pretty rare that you would find that extreme example in practice. Absolutely. So I would say that the key difference between the RCHC and the meta-analysis is just the generalizability. What kind of generalizations are warranted? And with the RCT, the generalizations are basically comparing the control group with the treatment group. With the meta-analysis, it's a much wider range of application.
Although the quality of our evidence is weaker. So you've got this trade-off, you've got a wider context of generalization, but more assumptions that you need to make to draw firm conclusions. So again, there's always going to be a trade-off here. The really important thing is what kind of question are you trying to answer? And typically in educational research, we are trying to work out, will this reform produce improvements for our students?
in all the different contexts of application. And often we think that one RCT will do that for us. I mean, maybe the classic example here is the Tennessee Star Study. where they randomly allocated students and teachers to small classes, 13 to 17, larger classes, 22 to 25, or large classes with an aide.
And so what they found was that the students who allocated to the smaller classes did better over first grade, second grade, third grade. They were 11 percentage points more likely to graduate high school. So people thought small classes are great. But when they implemented it in California, the class size reduction program was done statewide. And the crucial problem was the supply of additional teachers.
The Tennessee Star Study only needed 50 extra teachers. And it's reasonable that you might find 50 extra teachers. But finding an extra 100,000 teachers or 50,000 teachers. The high-quality ones. high quality well yeah exactly the whole point is they weren't so they were giving emergency permits to teachers in california and they reduced class size and lowered student achievement in many districts because they were employing people who shouldn't have been teachers
So that was a key problem. So yes, it doesn't work if teacher recruitment is challenging. The other thing that people forgot about the Tennessee SAR study was all these classes were in the same elementary school. So you needed an elementary school with at least 55 students to have the 13 in the small group, the 22 in the larger group, and the 22 in the larger group with a teacher aide.
Where are you going to find 55 students, 57 students in kindergarten? In urban schools. There are very few rural schools in Tennessee that had 55 students. So rural schools. were not able to participate in the experiment because they didn't have enough students. So the result was stable. For urban schools in Tennessee, it would be unwise. to generalize to rural schools about the effects of class size reduction because no rural schools were included because they didn't have enough students.
So it's about, that's why we need to get into the weeds on these research studies, because what you're allowed to conclude, what you're allowed to generalize from the context of the research depends on how much you know about the context of that research versus... the schools to which you want to generalize. That's why it's always going to be messy. There's never going to be a kind of slam dunk final answer on any educational research question. Yeah, I completely agree.
You know, one of the assumptions that I see made a lot that sort of I don't I really think is problematic is that the really rigorous RCT is accurate or precise might be the better word. and that we've removed the statistical noise from the experiment by having rigorous enough conditions.
But I think that's almost incredibly difficult to do. And I'll go back to the example of reading recovery for a second. And I like looking at reading recovery when looking at research methodology because all the studies are on the same grade. All the studies are on the same program.
Most of them are randomized. And actually, there's very little indication of publication bias. So whatever you want to say about reading recovery researchers, from what I can tell, they've been very honest in publishing studies when they have negative or positive results. But if you look at the study that, in my opinion, is the most rigorous in design, it's Holleyman 2013. And it produced a large, statistically significant effect size. However...
Most of the studies don't show effect size that is as large as Holleyman 2013. And Holleyman 2013 also tends to be one of the smallest studies. So then you have to ask yourself, is it the most representative because it's the most rigorous design? Or are the other studies, which have larger sample size and have a more homogenous result, more representative? And I tend to assume the second. But, you know, as you just pointed out, and I agree with you.
And any assumption is kind of dangerous in education because it's really hard to make really final definitive findings in education.
¶ The Reading Recovery Enigma
reading recovery is really interesting because i've tried to make sense of the available research and i can't you know often i can find a reason why different research studies come up with different answers so there's i mean the reading research reading recovery stuff i just don't get so i did look at
in Australia and New Zealand. And in one case, I can't wait which way around it was, but either in New Zealand or in Australia, it benefited high achievers and penalised low achievers. And in the other country, the effect was the opposite way around. So there's something going on here that I really don't understand. It could be that maybe it's not the same intervention, maybe the way that people are trained in different countries.
And I've changed my mind about reading recovery just because I've now become convinced. This is Carl Sagan's point that extraordinary claims require extraordinary evidence. Given Reading Recovery's inconsistency with lots of the other research about the best way to teach reading.
I think I want pretty robust results before I actually conclude that they're right. Now, I'm not saying that reading recovery doesn't work, but I do want two things. I want robust results and I want long-term studies. Because for me... The issue about reading recovery is not whether kids improved reading. It's whether that trajectory has continued. And the crystallizing point for me is this. Triple cueing, three cueing.
guessing words is fine if you want to derive the meaning of a particular text so if i want to know what this particular text is saying then you know three queuing is a good system It's a lousy way to teach students to read different texts. So often the difference is between teaching kids to read and teaching kids to derive meaning from a specific text that is in front of them.
Now, three cueing words for the second case, it doesn't work for the first. So often there are these differences that I just want to explore more. But of course, then you've got the other issue, which I've defended reading recovery. in that everybody says phonics is great. We should actually ban reading recovery. But how are you going to get that to scale? The fact is that reading recovery has a mechanism for reaching a large number of students in a reasonable amount of time.
And if it's definitely damaging, then maybe we shouldn't do it. But if it's only partially successful, then the fact that it can be delivered and rolled out is part of the policy conversation. In other words, if you've got a wonderful way of teaching reading that is 100% successful, if you do it, but we can't figure out how to train anybody else to do it, then that's not policy relevant.
Yeah, I agree with a lot of what you said. There's very few programs that have as many RCTs done on them as re-recovery. But if you look deeply at those RCTs... They all show very different results, which is why I like to highlight it is why any RCT is not necessarily reliable and or precise might be the better word. And.
I don't think that's an example of publication bias. I think it might actually be the result of the opposite in part of not publication bias. But the other thing I think you hit the nose on the head is that because it's not really a program, it's a training system. And I think that training system probably does look different in different places. In fact, we interviewed several re-recovery teachers, about 10, if memory serves me correct. And they had varying answers to different questions.
That might be a big part of what's going on there. Absolutely. And of course, the other issue, you know, use the term noise. I often use the term error in the statistical sense of not being part of the model. So if we're getting wide variation in the effect sizes. It could be that these experiments are badly conducted and there's just a whole lot of random effects. My hunch is it's more likely there are crucial variables that we haven't yet identified.
There are crucial variables in either the training or the implementation that explains all these different effect sizes. We just don't know what they are yet. So that's why I think a combination of quantitative research with much more kind of qualitative research to understand what are the theoretical moderators here. So people often say we need quantitative research, but often it's the qualitative research.
that it tells us what are likely to be the important moderators of effect. Without the knowledge of what the moderators are, we're never going to be able to get beyond world research.
¶ Identifying Moderators and Robustness Checks
is inconsistent or the effect sizes vary. What I want to know is why they vary. And that seems to me a way that we might make progress is to identify the reasons for the variation. I agree. I think it's incredibly difficult. And I will say for the four papers I have currently submitted to peer review, each one contains a regression analysis where I tried to identify the variables that I thought were going to have the greatest impact.
And I was clearly wrong because my regression analysis didn't really show regression of effect. I regressed the variables and yet they really. the results across that regression analysis, they looked random to me for lack of a better word. And I felt more comfortable with the mean effect than I did with any specific effect in that regression.
Right, but doing those regressions is a really important part of the discovery process, because if you hadn't put them in, a reviewer would say, ah, but you didn't do this, this, and this. don't produce a big difference in the findings, attest to the robustness of your result. And this is very common in economics. You know, if you look at it, when people do economic modeling, you know.
they actually do robustness checks so how much do the results that we get vary according to the assumptions we made and i think i'd like to see much more of that in educational research just you know you present the findings
And then you do a series of robustness checks to see whether the results would be different if you'd made different assumptions in the analysis. That would give me much more faith that the results weren't just... the result you know the consequence of decisions that was taken by the researcher on a whim yeah no that that's a really good point
Well, I really want to thank you for your time for this discussion. I think it will be the special viewer or listener who gets all the way through this. This was by far the nerdiest podcast I've ever done, although.
It might be my most enjoyable because I got to sit here, talk to one of my heroes about statistics for an hour. So I really appreciate you coming on the podcast and having this chat with me. And I hope that we have a lot of viewers and listeners will get all the way through because I think they'll learn a lot from what you have to offer. in this discussion. Thank you. It's been fun.
