The 20% Statistician: The correlation between original and replication effect sizes might be spurious

Friday, January 29, 2016

The correlation between original and replication effect sizes might be spurious

In the reproducibility project, original effect sizes correlated r=0.51 with the effect sizes of replications. Some researchers find this hopeful.

Less-popularised findings from the "estimating the reproducibility" paper @Eli_Finkel #SPSP2016 pic.twitter.com/8CFJMbRhi8
— Jessie Sun (@JessieSunPsych) January 28, 2016

I don’t think we should be interpreting this correlation at all, because it might very well be completely spurious. One important reason why correlations might be spurious is the presence of different subgroups, as introduction to statistics textbooks explain.

When we consider the Reproducibility Project (note: I’m a co-author of the paper) we can assume there are two subsets, one subgroup consisting of experiments that examine true effects, and one subgroup consisting of experiments that examine effects that are not true. This logically implies that for one subgroup, the true effect size is 0, while for the other, the true effect size is an unknown larger value. Different means in subgroups is a classic case where spurious correlations can emerge.

I find the best way to learn to understand statistics is through simulations. So let’s simulate 100 normally distributed effect sizes from original studies that are comparable to the 100 studies included in the Reproducibility Project, and 100 effect sizes for their replications, and correlate these. We create two subgroups. Forty effect sizes will have true effects (e.g., d = 0.4). The original and replication effect sizes will be correlated (e.g., r = 0.5). Sixty of the effect sizes will have an effect size of d = 0, and a correlation between replication and original studies of r = 0. I’m not suggesting this reflects the truth of the studies in the Reproducibility Project – there’s no way to know. The parameters look sort of reasonable to me, but feel free to explore different choices for parameters by running the code yourself.

As you see, the pattern is perfectly expected, under reasonable assumptions, when 60% of the studies is simulated to have no true effect. With a small N (100 studies gives a pretty unreliable correlation, see for yourself by running the code a few times) the spuriousness of the correlation might not be clear. So let’s simulate 100 times more studies.

Now, the spuriousness becomes clear. The two groups differ in their means, and if we calculate the correlation over the entire sample, the r = 0.51 we get is not very meaningful (I cut off original studies at d = 0, to simulate publication bias and make the graph more similar to Figure 1 in the paper, but it doesn't matter for the current point).

So: be careful interpreting correlations when there are different subgroups. There’s no way to know what is going on. The correlation of 0.51 between effect sizes in original and replication studies might not mean anything.

21 comments:

Tal YarkoniJanuary 29, 2016 at 7:04 PM
I'm not sure this is right. Your simulation assumes that the probability of a study belonging to the spurious vs. real subgroups is completely independent of the observed effect size in the original study. If that's true, your analysis goes through as stated--but only because you've already assumed your conclusion, as you're simply stipulating that the size of an observed effect in a sample can provide no indication of the true population value. This seems to me to be an untenable assumption: holding sample size constant, there is necessarily a positive relationship between population and sample effect sizes. Even if you allow for sample size variability, the relationship is likely to be positive, so why would you assume it's 0?

A separate issue is that it's not clear what the justification is for assuming that there are discrete sets of studies, one with all population effects == 0, and one with non-zero effects distributed N(x,y). This seems to me implausible in the extreme. What causal structure could the world possibly have that allows a large proportion of effects to be exactly zero, and a discrete and non-overlapping subset to be centered on a non-negligible value *other* than zero? Surely it's much more reasonable to model the population of studies as some continuous distribution, probably centered at or near zero, and perhaps with fat tails. But if you start from that prior, I'm not sure the simulation goes through, even if you ignore the issue raised above.
ReplyDelete
Replies
UnknownJanuary 29, 2016 at 10:44 PM
I am also not sure this is right. I'm pretty sure its not actually.

What about effects that are true but operationalized poorly, or effects that are false but due to a flaw in the theory and operationalization it turns out true? Or, say studies that were actually different in the replication than in the original? there seem to be much more to say, but I know you wont learn! :D
ReplyDelete
Replies
Ulrich SchimmackJanuary 30, 2016 at 1:06 AM
What do you mean by spurious? Evidently, the observed effect sizes in the original and the replication study are not causally related because they are a function of the true effect size in the population. This pattern of correlation is typically called spurious so to say that the relationship is spurious is not invalidating the importance of finding a correlation between A and B which shows that there is variation in the common cause C.

Variation in the common cause C, the population effect sizes, can take any form. Some of the true effect sizes may be zero. As long as some are non-zero, we expected a correlation between the observed effect sizes.

Ergo, we can conclude from the correlation that (a) not all population effect sizes are zero, and (b) that there is variability in the population effect sizes.

ReplyDelete
Replies
AnonymousJanuary 30, 2016 at 12:55 PM
I really don't think this shows anything at all. Yes, if you assume there are two subgroups with different means, then there will be a correlation if you ignore those two subgroups (even if there is there is no correlation within each group).

You write: "There's no way to know what is going on. The correlation of 0.51 between effect sizes in original and replication studies might not mean anything." Or it may mean that the correlation reflects a true relationship between the effects in the original studies and the replications. If I would start out with that assumption, then simulate some data that way, then I can also get that 0.51 correlation. But that doesn't demonstrate that my assumption is any more correct than the one you are making. This is purely tautological.
ReplyDelete
Replies
UnknownJanuary 30, 2016 at 5:35 PM
Appreciate the clean simulation, though it is far from Simpson's paradox or the like. I also don't think the simulated data match the reproducibility data well because by design you have a lot of points clustered around “Original” Effect Size d=0 while
in the reproducibility paper there are very few in that range, though there is a cluster around d=0.2 or so. Thus, although you get the similar correlation, that is not enough to conclude you are simulating the original data well, so your conclusion that you could get the correlation of ~0.5 by the two-subgroup scenario you desribe, while true, is not relevant to the actual results. It is simply a mere possibility of what is happening with all (?) experiments perhaps. To make your subgroup scenario relevant to the reproducibility paper, you should actually fit a mixture model on the reproducibility results and examine fit.

Very related point: in your simulation, the “Original” effect size is not really original, but a sample from the true effect size. In the reproducibility paper, the “Original” is the *observed* effect size, where the true is unknown, of certain experiments of interest where usually these d > 0.15 or so. Thus, to adequately simulate that data, you need to censor (in statistical sense) your d=0 subgroup, at which point it will be clear your “spuriousness” as indicated by separation of the two subgroups would not nearly be so clean.
ReplyDelete
Replies
Mark HoffarthFebruary 2, 2016 at 12:01 AM
I also commented on Twitter, but wasn't able to fit my comment into the character limit.

I don't think the results of this simulation necessarily imply that the correlation isn't meaningful. Could be could not be, but you've demonstrated it's certainly questionable, given that the correlation could arise for a couple very different reasons.

I largely agree with your conclusion that the correlation has been interpreted incorrectly. But I think it could be interpreted meaningfully, even if the simulation accurately accounts for why the correlation was observed. My thinking is that if we interpret the r = .51 as if it were converted into a logistic regression (with replicate vs. not as a dichotomous outcome regressed on original effect size), it would lead to the conclusion that large effect sizes were more likely to replicate than small effect sizes. This would indicate we should be more skeptical of small effect sizes, because the smaller effect sizes were more likely to fall in the "not replicated" category. This is still useful and I would argue meaningful information, but it would suggest we should actually be more skeptical of small effect sizes, whereas the r = .51 could lead one to come to the completely opposite conclusion, that this correlation is actually good news for small effect sizes.
ReplyDelete
Replies

Add comment