Classifieds Previous Issues Issue Cover APA Home What's New Contact Us Site Map Search






VOLUME 30 , NUMBER 5 May 1999

SHARED PERSPECTIVE

Significance tests and 'results-blindness'

By Joseph J. Locascio, PhD
Massachusetts Institute of Technology, Cambridge,
and Massachusetts General Hospital, Boston

Limitations and inappropriate uses of null hypothesis statistical significance testing (NHST) in behavioral research have been widely cited. Critics recommend alternative data analysis approaches and even outright "banning" of it from professional journals. I agree with most criticisms, but would stop short of supporting a ban.

In my opinion, this controversy misplaces the locus of the problem somewhat. I believe part of the difficulty with the current use of NHST is the exaggerated practical implications that have come to be attached to its results. The debate is implicitly fueled by the excessive weight given to whether study's' primary results are statistically significant in determining whether they get reported in the literature. A study with nonsignificant findings is often considered a "failure," not meriting being written up or submitted for publication. If it is, the nonsignificance diminishes the paper's chances of being accepted.

There are negative consequences to this practice. The very definition of NHST posits that if a null hypothesis (Ho) is true (or effects negligible), on average, about five percent of all studies testing essentially this same Ho will, by chance, find p <= .05. These studies will be virtually the only ones with publication potential for reasons noted above.

As a result, the scientific community is presented with biased information. A true Ho will appear to be ubiquitously rejected (with reported results in the same direction when tests are one-tailed or implicitly so, i.e., significance in a nonintuitive direction is assumed chance error). Additional problems are: Significant but trivial effects are over-emphasized; for false Ho, "nonsignificant" results showing alternates as more likely are unreported; data-bases for secondary analysis become biased too.

In evaluating whether to report a study in a scientific journal, I feel greatest weight should be given to the importance of the questions being asked (conveyed in a report's Introduction) and the soundness of the methods employed to answer them (communicated under Methods, which should also justify statistical techniques). If these criteria are met, the study's results are potentially useful--regardless of whether or not "effects" were found, hypotheses confirmed, or statistical "significance" achieved (unless nonsignificance is linked to method flaws). The absence of an effect is no less a valid finding than the presence of one, if the study has sound methods (including adequate statistical power). Some exceptions notwithstanding, I believe the decision to report a study should generally be made "results-blind" (Results and Discussion sections of reports must still be otherwise reviewed).

I agree with those who advocate modifying teaching practices and textbooks to show the limitations of NHST and the usefulness of alternative techniques. However, I think a blanket ban on NHST is an over-correction that well-intentioned people would not feel necessary if the practical consequences of "statistical significance" were not so extreme to begin with (a publication precondition).

Correctly employed and accurately interpreted, NHST can have a limited use as one of many research tools. But it should not have veto power over whether studies are reported at all. Journals should make explicit in their instructions to authors that statistical significance/nonsignificance of findings per se is not a factor in deciding for publication. Nor is the presence/absence of effects, however indexed. (An initial increase in journal submissions will result, but eventually elevated methodological standards should have a self-censoring effect.)

Authors should also be instructed to report important nonsignificant results in the same detail they do significant effects, so everything can be quantitatively evaluated. Literature reviews and meta-analyses would then provide inductions and theoretical interpretations based on an accumulated unbiased body of published "statistically significant" and "nonsignificant" findings.

Implicit in what I have said is the notion that the "test" of significance provides a fine-grained, not categorical, result-- a p value, not a decision for or against Ho. This p should be taken for what it is, a continuous estimate of a probability (that of the obtained result, assuming Ho, not that of Ho). A rigid demarcation at 0.0500000.... should not be enforced that determines whether a population effect exists or not and whether the findings will be censored from public knowledge. Attaching the word "significant" (or "reliable") only to results associated with p <= .05 is an arbitrary convention that, to me, is useful, at best, as a convenience in communication. Successful studies are not necessarily those reporting p <= .05. They are studies that address important and relevant questions with sound methods. And what they find, or don't find, are their reportable findings.


Joseph J. Locascio, PhD, is a statistician/lecturer in the department of brain and cognitive sciences at the Massachusetts Institute of Technology, Cambridge, and a biostatistician in the neurology department at Massachusetts General Hospital, Boston.



Read our privacy statement and Terms of Use

Cover Page for this Issue

PsychNET®
© 1999 American Psychological Association

APA Home Page . Search . Site Map