Response to Rossiter on Replication
Ray Hubbard reviews the reliability of a result's replicability
Response to Rossiter on Replication
By Ray Hubbard
Drake University, Des Moines, Iowa
John Rossiter claims that the p value in an original study already contains the information needed about its likely replicability so that -"unless the p value is marginal [he provides no definition of this term] there is no need to replicate." He then cites his 2003 JBR article justifying this claim. Upon reading his article it is clear from Rossiter’s citations that he relied heavily on Nickerson’s (2001) Psychological Methods (generally excellent, I agree) paper, especially the subsection "Belief That a Small p Is Evidence That the Results Are Replicable" (pp. 256-7). In this subsection, Nickerson cites several works showing this belief to be widespread among psychologists (so Rossiter has lots of company). Some of these studies listed by Nickerson reveal a popular tendency to misinterpret the complement of the p value as a direct index of replicability. One, by Oakes (1986, p. 80), offered empirical evidence on this issue. Specifically, Oakes found that 60% of a sample of British academic psychologists believed that an experimental outcome that is significant at the 0.01 level has a 0.99 probability of being statistically significant if the study were replicated. (This finding was reported in Hubbard and Armstrong, 1994, p. 240.) More recent empirical evidence from a survey of German psychologists (Gigerenzer, Krauss and Vitouch, 2004, p. 395) indicates a substantial and continuing tendency to consider this 1-p criterion as the probability of a replication "success." The percentage doing so was reported for three groups: students who had passed at least one statistics course in which significance testing was taught (41%), faculty not teaching statistics (49%), and most disturbing of all, faculty teaching statistics (37%). But this view that the 1-p criterion stipulates the probability of a replication "success" is incorrect.
Moreover, Nickerson (2001, p. 256) cites nine additional articles in this subsection which emphasize "that a small p value does not guarantee replicability of experimental results." Rossiter ignores these. He chooses, instead, to subscribe to Nickerson’s (p. 256) conjecture that "a bet on replicability of a result that yielded p < .001 would be a safer bet on the replicability of a result that yielded p < .05." But why would we want to engage in "betting games" about the replicability of a result? As one scenario, how much would you be willing to bet on the merits of the p value as an index of the replicability of a result if the original study reports a tiny p value caused by a trivial effect size and an extremely large sample size? According to Rossiter, this tiny p value found in the initial research would be seen as evidence all but guaranteeing a "successful" replication, and thus "there is no need to replicate." Not so. If a replication was indeed conducted and found an essentially similar, trivial effect size, but used a considerably smaller ("typical"?) sample size, the result would not replicate, i.e., p > .05. Paradoxically, this so-called "failure" to replicate ("insignificant" p value) is, in fact, a success (same effect sizes). The p value in the original study is highly statistically significant only because it is very sensitive to sample size considerations. The p value is a very poor demarcation criterion for determining replication successes/failures.
What we need are more reliable measures of a result’s replicability (like overlapping confidence intervals around both studies’ point estimates) over the spectrum of empirical conditions generally encountered by researchers. Unfortunately, the p value does not "convey a well-understood and sensible message for the vast majority of problems to which it is applied" (Berger and Sellke, 1987, p. 135), including its use as a measure of replication "success."
The fact of the matter is there is no formal warrant for using p values as measures of the replicability of results. Indeed, Hubbard and Lindsay (manuscript in preparation) argue that even when the original and replication studies both yield statistically significant results in the same direction (the usual definition of a replication "success"), this does not necessarily imply that a successful replication has been obtained. This is because using p values in this manner can thwart our ability to see the extent to which results truly generalize.
It is the systematic replication and extension of the results of previous studies, rather than p values from individual ones, that promotes cumulative knowledge development. Ironically, Sir Ronald Fisher (1966, p. 13), who popularized the use of p values, in fact placed little stock in statistically significant results from single studies: "we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural phenomenon." Indeed, Fisher was a major advocate of the practice of replication, as his daughter and biographer, Joan Fisher Box (1978, p. 142), makes clear: "Fisher had reason to emphasize, as a first principle of experimentation, the function of appropriate replication in providing an estimate of error." And Fisher Box (1978, p. 142) credits Fisher with coining the expression "replication": "the method adopted was replication, as Fisher called it; by his naming of what was already a common experimental practice, he called attention to its functional importance in experimentation." Fisher (1966, p. 102) championed in particular the importance of replication with extension research: "we may, by deliberately varying in each case some of the conditions of the experiment, achieve a wider inductive basis for our conclusions, without in any degree impairing their precision." Sadly, Fisher’s message about the crucial role of replication research has gotten drowned out by the hegemony of p values and their widespread misinterpretation. Steiger (1990, p. 176) knows the antidote: "An ounce of replication is worth a ton of inferential statistics."
Some final, brief responses to Rossiter:
- Rossiter notes that "Evanschitzky et al. ignore the generalizability of the original study’s findings." However, one cannot generalize on the basis of a single study. All you can do is speculate. Science demands more than the latter.
- Rossiter claims "Editors’ widespread insistence on replication with nonstudents improves [respondent heterogeneity]." Where is the documentation for this claim? On the other hand, we do know that editors and reviewers are biased against the publication of replications (see Hubbard and Armstrong, 1994 and Hubbard and Vetter, 1996, among others, for references demonstrating this).
- Rossiter states "Single stimulus studies where the single stimulus differs in the replication has to be the main reason for ‘failure’ to replicate." Again, where is the evidence supporting such an assertion?
- Rossiter postulates "the likelihood that only the poorer studies are chosen for replication in the first place." Ditto!
- Rossiter continues "Scott Armstrong and his colleagues [which includes me] don’t really understand what ‘replication’ means and they continue to tell their horror story about the lack of replications and make totally misleading implications." First, we do not think weight should be given to ad hominem arguments. Second, our definitions of replications (and extensions) are in the public domain. How does Rossiter define "replication"? Third, this same inappropriate comment evidently applies with equal conviction to Brown and Coney (1976) and Zinkhan et al. (1990) whose estimates of published replication research in marketing were 2.8% and 4.9%, respectively. These estimates are remarkably similar to those of Hubbard and Armstrong (1994)-2.4%-and Hubbard and Vetter (1996)-2.9%. Of particular importance, the 95% confidence intervals of these four studies overlapped, suggesting that they are all estimating the same population parameter. In contrast, the 95% CI for the Evanschitzky et al. (2007) study did not overlap with the average of the four predecessor works, thus signaling a meaningful decrease in the percentage of published replication research in marketing.
Berger, J.O. and Sellke, T. (1987), "Testing a Point Null Hypothesis: The Irreconcilability of p Values and Evidence," Journal of the American Statistical Association, 82, 112 139.
Brown, S.W. and Coney, K.A. (1976), "Building a Replication Tradition in Marketing," in Marketing 1776-1976 and Beyond (K.L. Bernhardt, ed.) 622-625. AMA, Chicago.
Evanschitzky, H., Baumgarth, C., Hubbard, R., and Armstrong, J.S. (2007), "Replication Research’s Disturbing Trend," Journal of Business Research, 46, 411 415.
Fisher, R.A.F. (1966), The Design of Experiments. 8th ed. Edinburgh: Oliver and Boyd.
Fisher Box, J. (1978), R.A. Fisher: The Life of a Scientist. New York, Wiley.
Gigerenzer, G., Krauss, S., and Vitouch (2004), "The Null Ritual: What You Always Wanted to Know About Significance Testing But Were Afraid to Ask," in The Sage Handbook of Quantitative Methodology for the Social Sciences (D. Kaplan, ed.) 391-408. Thousand Oaks, CA: Sage Publications.
Hubbard, R. and Armstrong, J.S. (1994), "Replications and Extensions in Marketing: Rarely Published But Quite Contrary," International Journal of Research in Marketing, 11, 233 248.
Hubbard, R. and Vetter, D. (1996), "An Empirical Comparison of Published Replication Research in Accounting, Economics, Finance, Management, and Marketing," Journal of Business Research, 35, 153 164.
Nickerson, R. (2000), "Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy," Psychological Methods, 5, 241 301.
Oakes, M. (1986), Statistical Inference: A Commentary for the Social and Behavioral Sciences. New York, Wiley.
Steiger, J.H. (1990), "Structural Model Evaluation and Modification: An Interval Estimation Approach," Multivariate Behavioral Research, 25, 173 180.
Zinkhan, G.M., Jones, M.Y., Gardial, S., and Cox, K.K. (1990), "Methods of Knowledge Development in Marketing and Macromarketing," Journal of Macromarketing, 10, 3 17.