I am a big proponent of reproducible data analysis. In particular, I like when researchers share documents in formats like Sweave and knitr which weave statistical output (e.g., text, tables, and graphs) into a statistical report.

I previously asked for examples of reproducible research of any kind on Stats.SE and for advocating articles in psychology. However, at present I'm particularly interested in complete examples of reproducible meta-analysis.

Meta-analyses involve a number of steps. Summary statistics and study information is extracted from source studies. Various transformations and steps are applied to the data (e.g., corrections for reliability, conversion from one statistic to another, etc.). Various models are tested; tables and graphs are produced. Some journals are requiring that researchers supply tables of data used (e.g., references used, summary statistics). However, I can see meta-analysis as an area which could benefit from a more comprehensive reproducible approach: (1) it would permit greater inspection of specific methods used; (2) researchers could more easily build on the analyses used.

Thus, my questions:

**Are there any complete examples of reproducible meta-analysis preferably in psychology or a related discipline?****Is there any published advocacy for reproducible meta-analysis?**

I have started to see a fair bit of discussion about reproducible meta-analysis.

Tim Churches seems to have a github repository with a few examples of meta-analyses in R.

See in particular the public health example:

In reference to your first question, not a lot of published standards, I suggest some standard guideline on my site, but that's just that, suggestions. You might want to check out the Cochrane Collaboration though they have produced a set of standards for use in health care related meta-analysis.

You also may want to check an early-ish study by Shadish et al (1997) as an example:

Shadish, William, R., Matt, George, E., Navarro, Ana, M., Single, Gregory, Crits-Christoph, Paul, Hazelrigg, Anthony, J., Lyons, Larry C., et al., (1997). Evidence that therapy works in clinically representative conditions. Journal of Consulting and Clinical Psychology, 65, 355-65.

## Reproducibility of individual effect sizes in meta-analyses in psychology

To determine the reproducibility of psychological meta-analyses, we investigated whether we could reproduce 500 primary study effect sizes drawn from 33 published meta-analyses based on the information given in the meta-analyses, and whether recomputations of primary study effect sizes altered the overall results of the meta-analysis. Results showed that almost half (k = 224) of all sampled primary effect sizes could not be reproduced based on the reported information in the meta-analysis, mostly because of incomplete or missing information on how effect sizes from primary studies were selected and computed. Overall, this led to small discrepancies in the computation of mean effect sizes, confidence intervals and heterogeneity estimates in 13 out of 33 meta-analyses. We provide recommendations to improve transparency in the reporting of the entire meta-analytic process, including the use of preregistration, data and workflow sharing, and explicit coding practices.

### Conflict of interest statement

I have read the journal's policy and the authors of this manuscript have the following competing interests: Jelte M. Wicherts is a PLOS ONE Editorial Board member. This does not alter the authors’ adherence to PLOS ONE Editorial policies and criteria.

### Figures

Fig 1. Decision tree of primary study…

Fig 1. Decision tree of primary study effect size recalculation and classification of discrepancy categories.

Fig 2. Scatterplot of 247 original and…

Fig 2. Scatterplot of 247 original and reproduced standardized mean difference effect sizes from 33…

Fig 3. Scatterplot of 253 original and…

Fig 3. Scatterplot of 253 original and reproduced correlation effect sizes from 33 meta-analyses.

Fig 4. Frequencies of reproduced primary study…

Fig 4. Frequencies of reproduced primary study effect sizes with and without errors, per meta-analysis.

Fig 5. Scatterplot of reported and reproduced…

Fig 5. Scatterplot of reported and reproduced meta-analytic outcomes for meta-analyses using standardized mean differences,…

Fig 6. Scatterplot of reported and reproduced…

Fig 6. Scatterplot of reported and reproduced meta-analytic outcomes for meta-analyses using correlations, where all…

## Psychologists Call Out the Study That Called Out the Field of Psychology

Remember that study that found that most psychology studies were wrong? Yeah, that study was wrong. That’s the conclusion of four researchers who recently interrogated the methods of that study, which itself interrogated the methods of 100 psychology studies to find that very few could be replicated. (Whoa.) Their damning commentary will be published Friday in the journal *Science*. (The scientific body that publishes the journal sent ** Slate**an early copy.)

In case you missed the hullabaloo: A key feature of the scientific method is that scientific results should be reproducible—that is, if you run an experiment again, you should get the same results. If you don’t, you’ve got a problem. And a problem is exactly what 270 scientists found last August, when they decided to try to reproduce 100 peer-reviewed journal studies in the field of social psychology. Only around 39 percent of the reproduced studies, they found, came up with similar results to the originals.

That meta-analysis, published in *Science* by a group called the Open Science Collaboration, led to mass hand-wringing over the “replicability crisis” in psychology. (It wasn’t the first time that the field has faced such criticism, as Michelle N. Meyer and Christopher Chabris have reported in * Slate*, but this particular study was a doozy.)

Now this new commentary, from Harvard’s Gary King and Daniel Gilbert and the University of Virginia’s Timothy Wilson, finds that the OSC study was bogus—for a dazzling array of reasons. I know you’re busy, so let’s examine just two.

The first—which is what tipped researchers off to the study being not-quite-right in the first place—was statistical. The whole scandal, after all, was over the fact that such a low number of the original 100 studies turned out to be reproducible. But when King, a social scientist and statistician, saw the study, he didn’t think the number looked that low. Yeah, I know, 39 percent sounds really low—but it’s about what social scientists should expect, given the fact that errors could occur either in the original studies or the replicas, says King.

His colleagues agreed, telling him, according to King, “This study is completely unfair—and even irresponsible.”

Upon investigating the study further, the researchers identified a second and more crucial problem. Basically, the OSC researchers did a terrible job replicating those 100 studies in the first place. As King put it: “You’d think that a test about replications would actually reproduce the original studies.” But no! Some of the methods used for the reproduced studies were utterly confounding—for instance, OSC researchers tried to reproduce an American study that dealt with Stanford University students’ attitudes toward affirmative action policies by using Dutch students at the University of Amsterdam. Others simply didn’t use enough subjects to be reliable.

The new analysis “completely repudiates” the idea that the OSC study provides evidence for a crisis in psychology, says King. Of course, that doesn’t mean we shouldn’t be concerned with reproducibility in science. “We should be obsessed with these questions,” says King. “They are incredibly important. But it isn’t true that all social psychologists are making stuff up.”

After all, King points out, the OSC researchers used admirable, transparent methods to come to their own—ultimately wrong—conclusions. Specifically, those authors made all their data easily accessible and clearly explained their methods—making it all the easier for King and his co-authors to tear it apart. The OSC researchers also read early drafts of the new commentary, helpfully adding notes and clarifications where needed. “Without that, we wouldn’t have been able to write our article,” says King. Now that’s collaboration!

“We look forward to the next article that tries to conclude that we’re wrong,” he adds.

** Update, March 4, 2016, 8:00 a.m.:**We reached University of Virginia psychologist Brian Nosek, an author of the original study and executive director of the Center for Open Science. Nosek and his fellow co-authors have issued a rebuttal to the Gilbert et al commentary, which appears alongside it in Science. Nosek was also a reviewer of the Gilbert commentary.

The two papers agree on one thing, says Nosek: That the original study found a 40 percent reproducibility rate. But they differ on what to make of that rate. The original authors sought to introduce that number as a starting point, rather thant to characterize it as high or low. “The whole goal of this is to stimulate some real learning about reproducibility, because so far it’s all been speculation,” says Nosek. “This is the first time we’ve had some real data.”

By contrast, the Gilbert paper took that data and then “jumped to a conclusion, based on selective exploratory evidence,” in Nosek’s words. The Gilbert paper attributed the fact that the reproducibility rate was as low as it was—in the authors’ characterization—due in part to the replica studies being poor reproductions of the originals. “They’ve generated one hypothesis,” says Nosek. “It is an optimistic assessment.”

For example, the Gilbert commentary finds that replica studies were four times as likely to generate similar results in cases where the original researchers endorsed the replicas. Their conclusion: Many of the replicated studies were faulty. That isn’t necessarily true, says Nosek. “The other reason is that researchers who don’t really believe that their effect is robust may be less likely to endorse designs because they don’t have as much faith in their conclusions,” he says.

In addition, the response by Nosek and his co-authors points out:

Nosek adds that he is “very pleased that both the comment and response have been published.”

**Update, March 4, 2016:** The original image in this post has been removed because it was unrelated to the post.

## Hybrid method

Like fixed-effect meta-analysis, the hybrid method estimates the common effect size of an original study and replication. By taking into account that the original study is statistically significant, the proposed hybrid method corrects for the likely overestimation in the effect size of the original study. The hybrid method is based on the statistical principle that the distribution of *p* values at the true effect size is uniform. A special case of this statistical principle is that the *p* values are uniformly distributed under the null hypothesis (e.g., Hung, O’Neill, Bauer, & Köhne, 1997). This principle also underlies the recently developed meta-analytic techniques *p-*uniform (van Aert, Wicherts, & van Assen, 2016 van Assen et al., 2015) and *p*-curve (Simonsohn, Nelson, & Simmons, 2014a, b). These methods discard statistically nonsignificant effect sizes, and only use the statistically significant effect sizes in a meta-analysis to examine publication bias. *P-*uniform and *p-*curve correct for publication bias by computing probabilities of observing a study’s effect size conditional on the effect size being statistically significant. The effect size estimate of *p-*uniform and *p-*curve equals that effect size for which the distribution of these conditional probabilities is best approximated by a uniform distribution. Both methods yield accurate effect size estimates in the presence of publication bias if heterogeneity in true effect size is at most moderate (Simonsohn et al., 2014a van Aert et al., 2016, 2015). In contrast to *p*-uniform and *p*-curve, which assume that all included studies are statistically significant, only the original study is assumed to be statistically significant in the hybrid method. This assumption hardly restricts the applicability of the hybrid method since approximately 95% of the published psychological research contains statistically significant results (Fanelli, 2012 Sterling et al., 1995).

To deal with bias in the original study, its *p* value is transformed by computing the probability of observing the effect size or larger conditional on the effect size being statistically significant and at the population effect size (*θ*). Footnote 2 This can be written as

where the numerator refers to the probability of observing a larger effect size than in the original study (*y* _{O}) at effect size *θ*, and the denominator denotes the probability of observing an effect size larger than its critical value ( ( *θ*. Note that ( *θ*. The conditional probability *q* _{O} at true effect size *θ* is uniform whenever *y* _{O} is larger than ( *p-*uniform for estimation and testing for an effect while correcting for publication bias (van Aert et al., 2016, 2015). The replication is not assumed to be statistically significant, so we compute the probability of observing a larger effect size than in the replication (*q* _{R}) at effect size *θ*

with the observed effect size of the replication denoted by *y* _{R}. Both *q* _{O} and *q* _{R} are calculated under the assumption that the sampling distributions of *y* _{O} and *y* _{R} are normally distributed, which is the common assumption in meta-analysis (Raudenbush, 2009).

Testing of H_{0}: *θ* = 0 and estimation is based on the principle that each (conditional) probability is uniformly distributed at the true value *θ*. Different methods exist for testing whether a distribution deviates from a uniform distribution. The hybrid method uses the distribution of the sum of independently uniformly distributed random variables (i.e., the Irwin–Hall distribution), Footnote 3 *x* = *q* _{O} + *q* _{R}, because this method is intuitive, showed good statistical properties in the context of *p-*uniform, and can also be used for estimating a confidence interval (van Aert et al., 2016). The probability density function of the Irwin–Hall distribution for *x* based on two studies is

and its cumulative distribution function is

Two-tailed *p* values of the hybrid method can be obtained with *G*(*x*),

The null hypothesis H_{0}: *θ* = 0 is rejected if *F*(*x* | *θ* = 0) ≤ .05 in case of a one-tailed test, and *G*(*x* |*θ* = 0) ≤ .05 in case of a two-tailed test. The 2.5th and 5th percentiles of the Irwin–Hall distribution are 0.224 and 0.316, respectively. Effect size *θ* is estimated as *F*(*x* | *θ* = ( widehat < heta>) ) = .5, or equivalently, that value of *θ* for which *x* = 1. The 95% confidence interval of *θ*, ( ( *F*(*x* | *θ* = ( *F*(*x* | *θ* = (

We will now apply the hybrid method to the example presented in the introduction. The effect size measure of the example in the introduction is Hedges’ *g*, but the hybrid method can also be applied to an original study and replication in which another effect size measure (e.g., the correlation coefficient) is computed. Figure 1 illustrates the computation of *q* _{O} and *q* _{R} for *θ* = 0 (Fig. 1a) and for *θ* = ( widehat < heta>) (Fig. 1b), based on the example presented in the introduction. The steepest distribution in both panels refers to the effect size distribution of the replication, which has the largest sample size. The conditional probability *q* _{O} for *θ* = 0 (Fig. 1a) equals the area larger than ( *y* _{O} (dark gray): ( _O=frac<0.015><0.025>=0.6 ) . The probability

*q* _{R} equals the one-tailed *p* value (.3/2 = .15) and is indicated by the light gray area. Footnote 4 Summing these two probabilities gives *x* = .75, which is lower than the expected value of the Irwin–Hall distribution, suggesting that the effect size exceeds 0. The null hypothesis of no effect is not rejected, with a two-tailed *p* value equal to .558 as calculated by Eq. 5. Shifting *θ* to hybrid’s estimate = 0.103 yields *x* = 1, as depicted in Fig. 1b, with *q* _{O} = .655 and *q* _{R} = .345. Estimates of the lower and upper bounds of a 95% confidence interval can also be obtained by shifting ( widehat < heta>) until *x* equals the 2.5th and 97.5th percentiles, for the lower and upper bounds of the confidence interval. The confidence interval of the hybrid method for the example ranges from – 1.109 to 0.428.

Effect size distributions of the original study and replication for the example presented in the introduction. Panels a and b refer to the effect size distributions for *θ* = 0 and *θ* = 0.103. *y* _{O} and *y* _{R} denote the observed effect sizes in the original study and replication, and ( _{0}: *θ* = 0 with *α* = .05. The shaded regions refer to probabilities larger than *y* _{R}, *y* _{O}, and ( *q* _{O} and *q* _{R}, and their sum by *x*

The results of applying fixed-effect meta-analysis and the hybrid method to the example are summarized in Table 1. The original study suggests that the effect size is medium and statistically significantly different from zero (first row), but the effect size in the replication is small at best and not statistically significant (second row). Fixed-effect meta-analysis (third row) is usually seen as the best estimator of the true effect size in the population and suggests that the effect size is small to medium (0.270) and statistically significant (*p* = .0375). However, the hybrid’s estimate is small (0.103) and not statistically significant (*p* = *.*558) (fourth row). Hybrid’s estimate is lower than the estimate of fixed-effect meta-analysis because it corrects for the first study being statistically significant. Hybrid’s estimate is even lower than the estimate of the replication because, when taking the significance of the original study into account, the original study suggests a zero or even negative effect, which pulls the estimate to zero.

Van Aert et al. (2016) showed that not only the lower bound of a 95% confidence interval, but also the estimated effect sizes by *p*-uniform can become highly negative if the effect size is estimated on the basis of a single study and its *p* value is close to the alpha level. Footnote 5 The effect size estimates can be highly negative because conditional probabilities such as *q* _{O} are not sensitive to changes in *θ* when the (unconditional) *p* value is close to alpha. Applying *p*-uniform to a single study in which a one-tailed test is conducted with *α* = .05 yields an effect size estimate of *p-*uniform equal to zero if the *p* value is .025, a positive estimate if the *p* value is smaller than .025, a negative estimate if the *p* value is larger than .025, and a highly negative estimate if the *p* value is close to .05. Van Aert et al. (2016) recommended setting the effect size estimate equal to zero if the mean of the primary studies’ *p* values is larger than half the α- level, because *p*-uniform’s effect size estimate will then be below zero. Setting the effect size to 0 is analogous to testing a one-tailed null hypothesis in which the observed effect size is in the opposite direction from the one expected. Computing a test statistic and *p* value is redundant in such a situation, because the test statistic will be negative and the one-tailed *p* value will be above .5.

The hybrid method can also yield highly negative effect size estimates because, like *p*-uniform, it uses a conditional probability for the original study’s effect size. In line with the proposal in van Aert et al. (2016), we developed two alternative hybrid methods, hybrid 0 and hybrid R , to avoid highly negative estimates. The hybrid 0 method is a direct application of the *p*-uniform method as recommended by van Aert et al., which recommends setting the effect size estimate to 0 if the studies’ combined evidence points to a negative effect. Applied to the hybrid 0 method, this translates to setting the effect size equal to 0 if *x* > 1 under the null hypothesis, and equal to that of hybrid otherwise. Consequently, hybrid 0 will, in contrast to hybrid, never yield an effect size estimate that is below zero. Applied to the example, hybrid 0 equals hybrid’s estimate because *x* = 0.75 under the null hypothesis.

The other alternative hybrid method, hybrid R (where the R refers to *replication*), addresses the problem of highly negative estimates in a different way. The estimate of hybrid R is equal to hybrid’s estimate if the original study’s two-tailed *p* value is smaller than .025 and is equal to the effect size estimate of the replication if the original study’s two-tailed *p* value is larger than .025. A two-tailed *p* value of .025 in the original study is used because this results in a negative effect size estimate, which is not in line with either the theoretical expectation or the observed effect size in the original study. Hence, if the original study’s just statistically significant effect size (i.e., .025 < *p* < .05) points to a negative effect, the evidence of the original study is discarded and only the results of the replication are interpreted. The estimate of hybrid R (and also of hybrid) is not restricted to be in the same direction as the original study as is the case for hybrid 0 . The results of applying hybrid R to the example are presented in the last row of Table 1. Hybrid R only uses the observed effect size in the replication—because the *p* value in the original study, .03, exceeds .025—and hence yields the same results as the replication study, as is reported in the second row.

Since all of the discussed methods may yield different results, it is important to examine their statistical properties. The next section describes the performance of the methods evaluated using an analytical approximation of these methods’ results.

## The Poldrack Lab at Stanford

The Poldrack Lab is based in the Department of Psychology at Stanford University. Our lab uses the tools of cognitive neuroscience to understand how decision making, executive control, and learning and memory are implemented in the human brain.

We also develop neuroinformatics tools and resources to help researchers make better sense of data. Projects we are involved with include:

- - developing new tools to enhance reproducibility of neuroscience research - the new data analysis/sharing platform developed by the Center for Reproducible Neuroscience - a knowledge base for cognitive neuroscience - a data sharing project for raw fMRI data - a data sharing project for statistical maps - a meta-analysis project using text mining - a new standard for organization of brain imaging data

Studies in our laboratory focus on healthy individuals, but we collaborate with other groups who are interested in studying neuropsychiatric disorders, including schizophrenia, bipolar disorder, ADHD, and drug addiction.

Our work is generously funded by the National Institues of Health, National Science Foundation, Office of Naval Research, and James S. McDonnell Foundation.

If you are interested in graduate study in the Poldrack lab, you can find more information here.

## Data Accessibility Statement

This study was preregistered with the Open Science Framework (OSF) at https://osf.io/fkmx7/. The preregistration adheres to the disclosure requirements of OSF. The preregistration and all data in the manuscript is available at this link.

### Appendix

Sample data-sharing request template |

Project title |

Principal investigator and institutional information |

Date: |

Requester/institutional affiliation: |

Data requested. Be as specific as possible (e.g., which interview questions or self-report scales?) |

Purpose of request. What will the data be used for? Describe the study aims, hypotheses, and what role the data will play. |

Data storage and dissemination. Where and how will the data be stored? Who will have access? Will the data be shared with any collaborators? How do you intend to publish the data? |

Protecting participant confidentiality. How will risks to participant confidentiality be managed? Will any special precautions be taken? |

Sample data-sharing request template |

Project title |

Principal investigator and institutional information |

Date: |

Requester/institutional affiliation: |

Data requested. Be as specific as possible (e.g., which interview questions or self-report scales?) |

Purpose of request. What will the data be used for? Describe the study aims, hypotheses, and what role the data will play. |

Data storage and dissemination. Where and how will the data be stored? Who will have access? Will the data be shared with any collaborators? How do you intend to publish the data? |

Protecting participant confidentiality. How will risks to participant confidentiality be managed? Will any special precautions be taken? |

The terms “replicability” and “reproducibility” have generated much definitional debate. In the current paper, we use the broad term “reproducibility” following Goodman et al. (2016) and consider replication studies as a specific, narrower topic (analogous to “results reproducibility” in Goodman’s terminology).

The status and nature of psychology as a science has been the focus of entire literatures (e.g., Fanelli, 2010 Ferguson, 2015 Hatfield, 2002 Lilienfeld, 2012 Meehl, 1978, 1986 Pérez-Álvarez, 2018). For the purpose of this article, we assume psychology is an empirical science of some kind(s) and attempt to synthesize perspectives of psychology as natural science and psychology as a human science.

*Grounded theory* is a qualitative data analysis technique in which researchers begin with no a priori hypotheses about the construct of interest, and instead use induction to generate theory directly from the data collected. For a description of grounded theory methodology, see for example Charmaz and Henwood (2017).

## Over half of psychology studies fail reproducibility test

Largest replication study to date casts doubt on many published positive results.

Don’t trust everything you read in the psychology literature. In fact, two thirds of it should probably be distrusted.

In the biggest project of its kind, Brian Nosek, a social psychologist and head of the Center for Open Science in Charlottesville, Virginia, and 269 co-authors repeated work reported in 98 original papers from three psychology journals, to see if they independently came up with the same results.

The studies they took on ranged from whether expressing insecurities perpetuates them to differences in how children and adults respond to fear stimuli, to effective ways to teach arithmetic.

According to the replicators' qualitative assessments, as previously reported by *Nature*, only 39 of the 100 replication attempts were successful. (There were 100 completed replication attempts on the 98 papers, as in two cases replication efforts were duplicated by separate teams.) But whether a replication attempt is considered successful is not straightforward. Today in *Science,* the team report the multiple different measures they used to answer this question 1 .

The 39% figure derives from the team's subjective assessments of success or failure (see graphic, 'Reliability test'). Another method assessed whether a statistically significant effect could be found, and produced an even bleaker result. Whereas 97% of the original studies found a significant effect, only 36% of replication studies found significant results. The team also found that the average size of the effects found in the replicated studies was only half that reported in the original studies.

There is no way of knowing whether any individual paper is true or false from this work, says Nosek. Either the original or the replication work could be flawed, or crucial differences between the two might be unappreciated. Overall, however, the project points to widespread publication of work that does not stand up to scrutiny.

Although Nosek is quick to say that most resources should be funnelled towards new research, he suggests that a mere 3% of scientific funding devoted to replication could make a big difference. The current amount, he says, is near-zero.

The work is part of the Reproducibility Project, launched in 2011 amid high-profile reports of fraud and faulty statistical analysis that led to an identity crisis in psychology.

John Ioannidis, an epidemiologist at Stanford University in California, says that the true replication-failure rate could exceed 80%, even higher than Nosek's study suggests. This is because the Reproducibility Project targeted work in highly respected journals, the original scientists worked closely with the replicators, and replicating teams generally opted for papers employing relatively easy methods — all things that should have made replication easier.

But, he adds, “We can really use it to improve the situation rather than just lament the situation. The mere fact that that collaboration happened at such a large scale suggests that scientists are willing to move in the direction of improving.”

The work published in *Science* is different from previous papers on replication because the team actually replicated such a large swathe of experiments, says Andrew Gelman, a statistician at Columbia University in New York. In the past, some researchers dismissed indications of widespread problems because they involved small replication efforts or were based on statistical simulations.

But they will have a harder time shrugging off the latest study, says Gelman. “This is empirical evidence, not a theoretical argument. The value of this project is that hopefully people will be less confident about their claims.”

The point, says Nosek, is not to critique individual papers but to gauge just how much bias drives publication in psychology. For instance, boring but accurate studies may never get published, or researchers may achieve intriguing results less by documenting true effects than by hitting the statistical jackpot finding a significant result by sheer luck or trying various analytical methods until something pans out.

Nosek believes that other scientific fields are likely to have much in common with psychology. One analysis found that only 6 of 53 high-profile papers in cancer biology could be reproduced 2 and a related reproducibility project in cancer biology is currently under way. The incentives to find results worthy of high-profile publications are very strong in all fields, and can spur people to lose objectivity. “If this occurs on a broad scale, then the published literature may be more beautiful than reality," says Nosek.

The results published today should spark a broader debate about optimal scientific practice and publishing, says Betsy Levy Paluck, a social psychologist at Princeton University in New Jersey. “It says we don't know the balance between innovation and replication.”

The fact that the study was published in a prestigious journal will encourage further scholarship, she says, and shows that now “replication is being promoted as a responsible and interesting line of enquiry”.

## Meta-Analysis Rationale and Procedures

As in any scientific field, social psychology makes progress by judging the evidence that has accumulated. Consequently, literature reviews of studies can be extremely influential, particularly when meta-analysis is used to review them. In the past three decades, the scholarly community has embraced the position that reviewing is itself a scientific method with identifiable steps that should be followed to be most accurate and valid.

At the outset, an analyst carefully defines the variables at the center of the phenomenon and considers the history of the research problem and of typical studies in the literature. Usually, the research problem will be defined as a relation between two variables, such as the influence of an independent variable on a dependent variable. For example, a review might consider the extent to which women use a more relationship-oriented leadership style compared with men. Typically, the analyst will also consider what circumstances may change the relation in question. For example, an analyst might predict that women will lead in a style that is more relationship-oriented than men and that this tendency will be especially present when studies examine leadership roles that are communal in nature (e.g., nurse supervisor, elementary principal).

Analysts must next take great care to decide which studies belong in the meta-analysis, the next step in the process, because any conclusions the meta-analysis might reach are limited by the methods of the studies in the sample. As a rule, meta-analyses profit by focusing on the studies that use stronger methods, although which particular methods are “stronger” might vary from area to area. Whereas laboratory-based research (e.g., social perception, persuasion) tends to value internal validity more than external validity, field-based research (e.g., leadership style, political attitudes) tends to reverse these values.

Ideally, a meta-analysis will locate every study ever conducted on a subject. Yet, for some topics, the task can be quite daunting because of sheer numbers of studies available. As merely one example, in their 1978 meta-analysis, Robert Rosenthal and Donald B. Rubin reported on 345 studies of the experimenter expectancy effect. It is important to locate as many studies as possible that might be suitable for inclusion using as many techniques as possible (e.g., computer and Internet searches, e-mails to active researchers, consulting reference lists, manual searching of related journals). If there are too many studies to include all, the analyst might randomly sample from the studies or, more commonly, narrow the focus to a meaningful subliterature.

Once the sample of studies is in hand, each study is coded for relevant dimensions that might have affected the study outcomes. To permit reliability statistics, two or more coders must do this coding. In some cases, an analyst might ask experts to judge methods used in the studies on particular dimensions (e.g., the extent to which a measure of leadership style is relationship-oriented). In other cases, an analyst might ask people with no training for their views about aspects of the reviewed studies (e.g., the extent to which leadership roles were communal).

To be included in a meta-analysis, a study must offer some minimal quantitative information that addresses the relation between the variables (e.g., means and standard deviations for the compared groups, F-tests, t-tests). Standing alone, these statistical tests would reveal little about the phenomenon.

When the tests appear in a single standardized metric, the effect size, the situation typically clarifies dramatically. The most common effect sizes are d (the standardized mean difference between two groups) and r (the correlation coefficient gauging the association between two variables). Each effect size receives a positive or negative sign to indicate the direction of the effect. As an example, a 1990 meta-analysis that Blair T. Johnson and Alice H. Eagly conducted to examine gender differences in leadership style defined effect sizes in such a way that positive signs were stereotypic (e.g., women more relationship-oriented) and negative signs were counterstereotypic (e.g., men more relationship-oriented). Typically, d is used for comparisons of two groups or groupings (e.g., gender differences in leadership style) and r for continuous variables (e.g., self-esteem and attractiveness).

Then, the reviewer analyzes the effect sizes, first examining the mean effect size to evaluate its magnitude, direction, and significance. More advanced analyses examine whether differing study methods change, or moderate, the magnitude of the effect sizes. In all of these analyses, sophisticated statistics help show whether the studies’ effect sizes consistently agree with the general tendencies. Still other techniques help reveal which particular studies’ findings differed most widely from the others, or examine the plausibility of a publication bias in the literature. Inspection for publication bias can be especially important when skepticism exists about whether the phenomenon under investigation is genuine. In such cases, published studies might be more likely to find a pattern than would unpublished studies. For example, many doubt the existence of so-called Phi effects, which refers to “mind reading.” Any review of studies testing for the existence of Phi would have to be sensitive to the possibility that journals may tend to accept confirmations of the phenomenon more than disconfirmations of it.

Various strategies are available to detect the presence of publication bias. As an example, Rosenthal and Rubin’s fail-safe N provides a method to estimate the number of studies averaging nonsignificant that would change a mean effect size to being nonsignificant. If the number is large, then it is intuitively implausible that publication bias is an issue. Other, more sophisticated techniques permit reviewers to infer what effect size values non-included studies might take and how the inclusion of such values might affect the mean effect size. The detection of publication bias is especially important when the goal of the meta-analytic review is to examine the statistical significance or the simple magnitude of a phenomenon. Publication bias is a far less pressing concern when the goal of the review is instead to examine how study dimensions explain when the studies’ effect sizes are larger or smaller or when they reverse in their signs. Indeed, the mere presence of wide variation in the magnitude of effect sizes often suggests a lack of publication bias.

Interpretation and presentation of the meta-analytic findings is the final step of the process. One consideration is the magnitude the mean effect sizes in the review. In 1969, Jacob Cohen informally analyzed the magnitude of effects commonly yielded by psychological research and offered guidelines for judging effect size magnitude. Table 1 shows these standards for d, r, and r2 the latter statistic indicates the extent to which one variable explains variation in the other. To illustrate, a small effect size (d = 0.20) is the difference in height between 15- and 16-year-old girls, a medium effect (d = 0.50) is difference in intelligence scores between clerical and semiskilled workers, and a large effect (d = 0.80) is the difference in intelligence scores between college professors and college freshmen. It is important to recognize that quantitative magnitude is only one way to interpret effect size.

Even very small mean effect sizes can be of great import for practical or applied contexts. In a close race for political office, for example, even a mass media campaign with a small effect size could reverse the outcome.

Ideally, meta-analyses advance knowledge about a phenomenon not only by showing the size of the typical effect but also by showing when the studies get larger or smaller effects, or by showing when effects reverse in direction. At their best, meta-analyses test theories about the phenomenon. For example, Johnson and Eagly’s meta-analysis of gender differences in leadership style showed, consistent with their social-role theory hypothesis, that women had more relationship-oriented styles than men did, especially when the leadership role was communal in nature.

Meta-analyses provide an empirical history of past research and suggest promising directions for future research. As a consequence of a carefully conducted meta-analysis, primary-level studies can be designed with the complete literature in mind and therefore have a better chance of contributing new knowledge. In this way, science can advance the most efficiently to produce new knowledge.

## Reproducibility Project

The **Reproducibility Project: Psychology** was a crowdsourced collaboration of 270 contributing authors to repeat 100 published experimental and correlational psychological studies. [1] This project was led by the Center for Open Science and its co-founder, Brian Nosek, who started the project in November 2011. The results of this collaboration were published in August 2015. Reproducibility is the ability to produce a copy or duplicate, in this case it is the ability to replicate the results of the original studies. The project has illustrated the growing problem of failed reproducibility in social science. This project has started a movement that has spread through the science world with the expanded testing of the reproducibility of published works. [2]

## The Advantages of Meta-Analysis

Meta-analysis is an excellent way of simplifying the complexity of research. A single research team can reasonably only output so much data in a given time. But meta-analysis gives access to possibly more data than that team could produce in a lifetime, and allows them to condense it in useful ways. As we make technological developments in computational power, new database programs have made the process even easier.

For rare medical conditions, meta-analysis allows researchers to collect data from further afield than would be possible for one research group. This allows them to conduct meaningful statistical analyses when a small local sample would have told them nothing about the disease.

When professionals working in parallel can upload their results and access all known data on a topic, there is a built-in quality control. The effects of error or bias in studies are kept in check. Meta-analysis also ensures there is no unnecessary repeat research and allows researchers to pool resources and compare methods. As papers can often take many months to be physically published, instant computer records ensure that other researchers are always aware of the latest work and results in the field.

A meta-study allows a much wider net to be cast than a traditional literature review, and is excellent for highlighting correlations and links between studies that may not be readily apparent as well as ensuring that the compiler does not subconsciously infer correlations that do not exist. Perhaps best of all, meta-studies are economical and allow research funds to be diverted elsewhere.