Hal Pashler, a psychologist at UCSD, tweeted my post about the “end of history” study, and this led me to his interesting paper, “Is the Replicability Crisis Overblown?” (with Christine Harris.) Like all papers whose title is a rhetorical question, it comes down in favor of “no.”

Among other things, Pashler and Harris are concerned about the widespread practice of “conceptual replication,” in which rather than reproduce an existing experiment you try to find a similar effect in an adjacent domain. What happens when you don’t find anything?

Rarely, it seems to us, would the investigators themselves believe they have learned much of anything. We conjecture that the typical response of an investigator in this (not uncommon) situation is to think something like “I should have tried an experiment closer to the original procedure—my mistake.” Whereas the investigator may conclude that the underlying effect is not as robust or generalizable as had been hoped, he or she is not likely to question the veracity of the original report. As with direct replication failures, the likelihood of being able to publish a conceptual replication failure in a journal is very low. But here, the failure will likely generate no gossip—there is nothing interesting enough to talk about here. The upshot, then, is that a great many failures of conceptual replication attempts can take place without triggering any general skepticism of the phenomenon at issue.

The solutions are not very sexy but are pretty clear — create publication venues for negative results and direct replication, and give researchers real credit for them. Gary Marcus has a good roundup in his New Yorker blog of other structural changes that might lower the error rate of lab science. Marcus concludes:

In the long run, science is self-correcting. Ptolemy’s epicycles were replaced by Copernicus’s heliocentric system. The theory that stomach ulcers were caused by spicy foods has been replaced by the discovery that many ulcers are caused by a bacterium. A dogma that primates never grew new neurons held sway for forty years, based on relatively little evidence, but was finally chucked recently when new scientists addressed older questions with better methods that had newly become available.

but Pashler and Harris are not so sure:

Is there evidence that this sort of slow correction process is actually happening? Using Google Scholar we searched <“failure to replicate”, psychology> and checked the first 40 articles among the search returns that reported a nonreplication. The median time between the original target article and the replication attempt was 4 years, with only 10% of the replication attempts occurring at lags longer than 10 years (n = 4). This suggests that when replication efforts are made (which, as already discussed, happens infrequently), they generally target very recent research. We see no sign that long-lag corrections are taking place.

It cannot be doubted that there are plenty of published results in the mathematical literature that are wrong. But the ones that go uncorrected are the ones that no one cares about.

It could be that the self-correction process is most intense, and thus most effective, in areas of science which are most interesting, and most important, and have the highest stakes, even as errors are allowed to persist elsewhere. That’s the optimistic view, at any rate.

I think your statement about math is only true if you interpret “cares about” in a very specific way that non-mathematicians will not understand and that will perhaps not be immediately clear to mathematicians, either.

I think the Yamabe problem is typical. Trudinger discovered that Yamabe had erred because he wanted to reuse the techniques. When people want to use the result as a black box, they often vet the proof, but surely not as carefully as someone who needs to understand to generalize. And there are a lot of celebrated results that are dead ends, that people use neither as black boxes nor as sources of techniques. Is it useful to say that they “are the ones that no one cares about”?

Similarly, in science “most interesting,” “most important,” and “most fashionable” are rather different. Ioannidis claims that “The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.” That is potentially compatible with hot fields doing a better job of reaching true conclusions in the end, but I am not sanguine.

I believe the replicability crisis is definitely not overblown. Here’s how I came to this conclusion:

I am a mathematician who became interested in statistics about fifteen years ago. Since there was a shortage of statisticians at my university then, I quickly found myself teaching graduate statistics courses. It was quite different from my experience teaching graduate math courses in topics I wanted to learn – I found that a lot of textbooks for graduate statistics courses were quite cookbook, so I needed to do a lot of searching to find out the “why’s” I naturally looked for. But my understanding (and hence he quality of what I taught) did improve slowly. I also found that many textbooks taught things that were misleading or just plain wrong

Teaching graduate statistics courses also resulted in being asked to serve on more Ph.D. committees outside math and being asked questions related to statistics by faculty and graduate students in various departments. I found more and more cases of published papers that were doing things that were questionable or just plain wrong, and often faced the difficult task of trying to explain tactfully to colleagues and graduate students why what they had been taught was wrong or misleading. I don’t believe I encountered any cases of willful deceit; I think of it as like the old kid’s game of “telephone,” where you sit in a circle, one person whispers a sentence into the next person’s ear, and so on around the circle, until the last person says out loud what they heard, and it’s usually hilariously different from what the first person said. Similarly, someone misunderstands something statistical and passes on the misunderstanding to their students or colleagues, and the next person passes it on with even more misunderstanding, until the misperceptions become codified as, “This is the way to do it.”

In addition, a few years ago I started attending a biology discussion group, where we read current research papers in that particular field of biology, and discuss them, with one person (typically a student taking the associated course for credit) tasked with presenting highlights of the paper for discussion. A surprising number of the papers contain questionable statistical practices or conclusions that are too strong for the analysis. Fortunately, the participants care about good science, and are receptive to my pointing out questionable practices.

Consequently, when I retired about three years ago, I started a website, Common Mistakes in Statistics: Spotting and Avoiding Them (http://www.ma.utexas.edu/users/mks/statmistakes/StatisticsMistakes.html). I invite anyone interested in this subject to visit it. (The design etc. qualities are a little primitive, but I hope the intellectual content is adequate. I have purposely tried to keep the math to a minimum, in the hope of making it accessible to as wide an audience as possible.) One particular problem that leads to a lot of false positives is multiple inference (see http://www.ma.utexas.edu/users/mks/statmistakes/multipleinference.html). It has received increasing attention in (at least some areas of) biology over the last several years, but nowhere near enough attention in the social sciences.

I also started teaching a short course, with the same name as the website, once a year in my university’s Summer Statistics Institute (there are notes on my home page, http://www.ma.utexas.edu/users/mks/), and more recently started a blog (http://www.ma.utexas.edu/blogs/mks/), which I hope to get more entries on soon. I hope that some of you who are interested in this subject will find these resources useful.

For more on this subject, see Andrew Gelman’s January 13 blog post Preregistration of Studies and Mock Reports, http://andrewgelman.com/2013/01/preregistration-of-studies-and-mock-reports-2/