top of page

Is There a Reproducibility Crisis?

On the Reproducibility Crisis: Text
On the Reproducibility Crisis: Files

Is There a Reproducibility Crisis?

 

Short answer: "No, I don't think so". In 2015 the Open Science Collaboration (Nosek et al 2015) published a highly influential paper in Science  which claimed that a large fraction of published results in the psychological sciences were not reproducible. The effort that went into this conclusion  was extraordinary. This group  selected 100 studies published in peer-reviewed journals, and conducted a  validatory study for each.  For each study a single finding that would be of scientific interest was selected according to a protocol. The Reproducibility Project (as it was called) was interested in  determining how many of these findings could be reproduced in the  validatory studies. There are various ways of quantifying reproducibility, but the bottom line is that "39% of effects were subjectively rated to have replicated the original result ..." (Nosek et al 2015).

Defining a Hypothesis Test

That looks quite problematic, at first glance. But to understand what is happening, it is important to understand the P-value (aka observed level of significance) and the concept of power and sample size determination. First, we need a null hypothesis Ho and an alternative hypothesis Ha. A null hypothesis represents the current status quo, which we accept as long as there is no contradictory evidence. In a court of law, Ho is "Not Guilty". If no evidence of guilt is presented during a trial,  then the defendant must be judged "Not Guilty".  

Then suppose we conduct a scientific study. We have two groups of people A versus B (young versus old; urban versus suburban; etc). We would like to know if there is a difference in some psychometric construct X between these groups (anxiety; confidence; etc). We might measure X on a sample from each group, then let D be the difference in the average measurements within the groups.  So, D is our test statistic.  We might then have null and alternative hypotheses:

Ho : The average values of X within groups A and B are the same.
Ha : The average values of X within groups A and B are NOT the same.

Now, D is random, but it still contains information which can help us determine which hypothesis is correct. If Ho is correct, we would expect D to be close to zero (but not exactly zero). So we devise a rejection rule, rejecting Ho if  |D| >=  t, where |D| is the absolute  value of the test statistic, and t is a critical value. Scientists usually wish to reject Ho, because the alternative hypothesis Ha represents a scientific finding of interest.

So how do we pick critical value t? Since the test statistic is random, we can't be absolutely certain that our decision rule leads to the correct conclusion, so we need to define two types of error. A Type I Error occurs when Ho is true but is rejected, and a Type II Error occurs when Ho is false, but is not rejected. We also call a Type I Error a false positive, and a Type II Error a false negative, which reflects the idea that rejecting Ho is a positive outcome, that is, we have found what we are looking for.

The level of significance, usually denoted alpha, is the probability of a Type I Error. As is well known, this is usually set to alpha = 0.05, but this is a convention, and not a mathematical necessity (this  value was first suggested by Ronald Fisher). There is nothing special about the number 0.05, but it is useful that a single value is almost universally adopted, because this enforces one single standard for statistical evidence.

The probability of a Type II Error is usually denoted beta. However, we have to recognize that the alternative hypothesis usually contains multiple values. In our example above, Ha merely states that there is a difference in X, not  how large that difference is.  But this is how we determine a sample size. Usually alpha = 0.05 is fixed. So we need to ask something like, "what sample size is needed to ensure beta = 0.1 when the average difference is really 2.2 units?" Here, 2.2 units is the effect size. This is enough information to calculate the sample size for this study. Then the power of the test is 1 - beta, that is, the probability of correctly rejecting Ho, assuming our target effect size is actually the true value.

What is a P-value?

So far, we have not referred to the P-value, which seems to drive much of the discussion of reproducibility.    There are several equivalent definitions, and none is really intuitive. Here is one:

"Given the observed data,  a P-value is the smallest significance level at which the null hypothesis would be rejected. A P-value is also known as the observed level of significance."

Thus, we can summarize the outcome of the hypothesis test by the P-value. In particular,  a test with significance level  alpha  = 0.05 will be rejected if and only if the  P-value satisfies P <= 0.05.  

Note that the smaller the significance level alpha, the more stringent the standard for statistical evidence (a finding that is statistically significant at level alpha = 0.05 might not be statistically significant at level alpha = 0.01). So, the P-value serves as a quantitative index of the strength of the statistical evidence. In practice many researchers would hesitate to publish an important finding if we had P = 0.049.

Back to the Reproducibility Project

So now we can talk about the role played by the P-value within the Reproducibility Project. First, the criterion for selection of a study includes P <= 0.05, that is, the finding is considered statistically significant by conventional standards.  However, the Reproducibility Project found that for most of the selected studies  the P-value of the validatory study was above 0.05. On this basis a reproducibility crisis was reported.

In 2017 I gave a seminar on the issue of reproducibility in which I argued that the analysis undertaken by the Reproducibility Project was flawed for a number of reasons, and its main conclusion void. I also believe it is important to point out that the arguments on which I base my conclusion are not novel to me, and in some cases are simple invocations of  well established principles of good statistical practice.

Underestimation of Sample Size for Validatory Studies

In one very important sense, the validatory studies of the Reproducibility Project did not reproduce the conditions of the original studies. Rather than simply use the original sample sizes, new sample sizes were estimated, and those used. The rationale for this offered was related to protocol standardization, and a method similar (but not identical) to the one I describe above was used.

In particular, the new sample sizes were estimated based on the effect sizes reported in the original papers. This seems reasonable, since a larger effect size means that a  smaller sample size is needed to detect a true effect. In our example above, we are interested not only in the existence of a group difference, but in the size of the group difference (ie the effect size).

But consider what the P-value represents.  The larger the effect size, the smaller, on average, will be the P-value. This means the effect size can, at least in principle, be estimated from the P-value following a suitable transformation. But remember that studies are only selected for the Reproducibility Project if P <= 0.05. This means that observed effect sizes are truncated from below. In turn, this means that effect sizes will be OVERESTIMATED. This is natural. Suppose we wish to estimate the average age of a population, and after taking a sample we obtain an estimate  A = 35.6 yrs. However,  if we truncated the sample from below by accepting only subjects of minimum age 12 yrs, our estimate would be higher, say A* = 41.6 yrs. This is exactly the type of truncation bias that will affect the effect size estimates made by the  Reproducibility Project. In fact, this possibility is acknowledged in a methodological supplement published with Nosek et al (2015), and is there referred to as "one of the potential challenges for reproducibility".  In contrast, the view of this author is that it is simply a mathematical problem which can be solved.

What this all means is that for a validatory study of the Reproducibility Project, effect size is being OVERESTIMATED, therefore, sample size is being UNDERESTIMATED, and therefore it is less likely to report a statistically significant finding than the original study.  

Of course, this effect is already well known. For example, an NIH source  on good statistical practice recommends against the use of preliminary studies to estimate sample sizes for future studies:
 
"Why can’t pilot studies estimate effect sizes for power calculations of the larger scale study?"

"Since any effect size estimated from a pilot study is unstable, it does not provide a useful estimation for power calculations. If the effect size estimated from the pilot study was really too large (i.e., a false positive result, or Type I error), power calculations for the subsequent trial would indicate a smaller number of participants than actually needed to detect a clinically meaningful effect, ultimately resulting in a negative trial ... "

Of course, the studies used by the Reproducibility Project were not "preliminary studies" (although the mathematics is the same). So the next question is whether or not the truncation bias effect resulting from the P <= 0.05 selection rule is large enough to compromise the conclusions of the Reproducibility Project. According to my own calculations, presented in the seminar, the answer is "yes". And there is a good reason to expect this. Studies are usually powered so that the sample size is large enough, but not too much larger than needed (budgets are finite after all, and granting agencies don't like over-powered studies much more than they like under-powered studies).

To make a long story short, this means that the effect size estimates made  for the validatory studies are quite vulnerable to the truncation bias effect we are describing. And by my own calculations, this bias renders the conclusions of the Reproducibility Project void.

So, What Should the Reproducibility Rate Be?

To answer this question, it must first be understood that the significance level (here alpha = 0.05) gives little guidance. The number alpha is a false positive rate, and is therefore only relevant when the null hypothesis Ho is really true.

To take a limiting case (often a good idea), suppose Ho is ALWAYS true for all studies in some field (like I said, a limiting case). Then of these studies,  1/20 will report P <= 0.05. These studies will make their way into journals, to be validated by the Reproducibility Project. Assuming the protocols were reasonable, about 1/20 = 5% of the findings would reproduce, and that would then be the reported reproducibility rate (not nearly as good as the reproducibility rate of 39% reported in Nosek et al 2015). Nothing wrong with the statistics. It's just that the true reproducibility rate will depend not only on alpha, but also on the power of a test, and also the proportion of studies conducted for which the alternative hypothesis really is true (which we can call the effect prevalence). This is the subject of another influential paper,  Ioannidis (2005), which has the pessimistic title  "Why most published research findings are false".


So, let's try to answer our question. The power (1 - beta) and significance level alpha are quantities that are well defined by the theory of hypothesis testing. The effect prevalence is not a conventional feature of most hypothesis tests (although it may play a role in a Bayesian analysis).  But we can still say something about it.   For example, we can refer to the concept of "clinical equipoise" (Freedman 1987). Suppose for some medical condition I have an experimental treatment B, which I believe may work better than conventional treatment A.   Should I conduct a clinical trial? If I know with certainty that B is preferable, then the answer is no, not just because the trial would be unnecessary, but because it would also be unethical. This is because I would be assigning treatment A to (typically)  1/2 of the subjects, thus denying them what we know to be the superior treatment. The principle of equipoise states that a clinical trial should be undertaken under conditions of maximum uncertainty with respect to the outcome. In this way, we are not assigning to any subject a treatment which is believed to be inferior (even in a probabilistic sense). In our simple two arm example (arms A and B), the effect prevalence would be 50%.

 

It should be added that the concept of clinical equipoise is not universally accepted (see the entry for clinical equipoise at WIKIPEDIA). For example, it might seem reasonable, even necessary, to trade off inferior treatment during a clinical trial with future certainty regarding treatment standards. But the notion is a useful one to keep in mind when thinking about reproducibility. And in a paper I would highly recommend, it is argued that over the last 50 years, clinical equipoise has been more or less attained (Djulbegovic et al 2013).

In my own unpublished calculations I estimate that if clinical equipoise holds, then under reasonable assumptions (and assuming that the statistical analysis is carried out correctly) we would have a reproducibility rate of about 80%. This is comparable to more optimistic reproducibility rate estimates reported by other researchers (Klein et al 2014;  Gilbert DT et al  2016; Etz & Vandekerckhove 2016). I also argue in my seminar that the effect prevalence may be much lower than the 50% which defines clinical equipoise, particularly for the studies selected by the Reproducibility Project. Of course, this would reduce the reproducibility rate further still. The bottom line here is that there is no reason to expect the reproducibility rate to be anywhere near 100%.

Is a False Positive Worse Than a False Negative?

The short answer: "No". However, the Reproducibility Project seems to be concerned only with false positives. This ignores a fundamental principle of decision theory, which is that the best decision rule attempts to balance  the cost of a false positive with the cost of a false negative. If a medical diagnostic test results in a false positive, this is certainly stressful, but the mistake will likely be discovered by re-testing or further examination. On the other hand, a false negative results in a missed diagnosis, and is clearly the more serious error.


I believe the same holds for experimental science.  We are well equipped to uncover false positives, and the need for validation is widely accepted, and conventional protocols exist for this purpose. A false negative is a potentially important discovery lost to science. This seems to me the more serious error.   

Literature Cited

  • Djulbegovic B et al (2013) Medical research: Trial unpredictability yields predictable therapy gains. Nature, 500, 395-396.

  • Etz A and Vandekerckhove J (2016) A Bayesian perspective on the reproducibility project:  Psychology.  PLoS One, 11(2):e0149794.

  • Freedman, B (1987) Equipoise and the ethics of clinical research. The New England Journal of Medicine, 317, 141-145.

  • Gilbert DT et al  (2016) Comment on ‘Estimating the reproducibility of psychological science’ , with response by the authors. Science, 351(6277), pp. 1037.

  • Ioannidis JPA (2005) Why most published research findings are false.  PLoS Medicine, 2:8, e124.

  • Klein RA et al (2014) Investigating variation in replicability:  A "Many Labs" replication project.  Social Psychology, 45(3):142–152.

  • Nosek  B et al (2015) Estimating the reproducibility of psychological science. Science, 349:6251, aac4716.



 

On the Reproducibility Crisis: Text

Reproducibility and Statistical Methodology

Seminar delivered at the University of Rochester on October 24th, 2017. See below for a discussion.

CBI-SEMINAR-POSTER-ALMUDEVAR_edited.jpg
bottom of page