Presidential Column
Rigor Without Rigor Mortis: The APS Board Discusses Research Integrity
Please excuse this further sidetrack from the road we were on in my previous columns. Two months ago, the column I had planned was displaced by a response to the considerable attention that various media paid to a social psychologist’s faking of data and the attendant questions about whether psychology was especially susceptible to cheating. The implication seemed to be that many, if not most, of the most striking results in psychology might be bogus.[i] I argued in my previous column that there is nothing special about psychology when it comes to fraud, that meta-analyses suggest that fraud is rare (about 2% of researchers admit to it), and that tools intrinsic to the practices of science, such as replication, help root out “false positives,” or Type I errors (concluding that some effect is present when in fact it is not), and produce a science we can believe in.
But can we do better, at least in the sense of encouraging practices that allow science to function more efficiently and effectively? The APS Board took up this and related questions at our retreat in early December. The discussion was animated and (in my opinion) very productive, so we decided that this “Boardologue” should be shared with the broader APS community. Here’s the plan we came up with: I would write a column outlining a few of the issues and possible recommendations and then we would begin spilling ideas over to the APS website by having board members share their perspectives. The third step intended is an open forum, refereed for relevance, redundancy and respect for our community. So here goes step one.[ii]
More than two decades ago, I was one of the people invited to help celebrate the 25th anniversary of the University of Minnesota Center for Research in Learning, Perception and Cognition. The speakers were invited to speculate on how the field of learning might or might not change 25 years in the future. The only thing I remember about my own talk was the tongue-in-cheek prediction that in 25 years counter-balancing would still be a good idea. The audience laughed (probably politely), but later on, a graduate student from my lab, David Thau, told me that after the laughter died down, the graduate student next to him turned and asked, “What’s counter-balancing?”
Well, I still think counter-balancing to control for order effects is a good idea and should be used when the study design permits it. Furthermore, the fact that it may be inconvenient to do so doesn’t strike me as a good excuse for not counter-balancing. Yes, you may have to cut and paste parts of your questionnaire six times when it seems like one order would do, but I think it’s worth it. First, if you find no order effects, you’re on your way to a more robust pattern of results. Second, if you do find order effects, you may open a new line of inquiry, perhaps regarding some sort of priming effect. If you don’t counter-balance but obtain statistically significant results anyway, you won’t know whether you have lucked into the one question order that can produce the result of interest. So the issue is less about “false positives” than it is about a false sense of security surrounding the generality of the results and your interpretation of them.
Let me now turn to other suggestions from my wish list.
1) Counter-balancing (see above).
2) More on methods and procedures. At a time when journals seem to be pushing for streamlined everything, including methods sections, there is a danger that potentially relevant procedural details will be missing. If we know (or think we know) that a messy versus neat experiment room or the presence of an American flag can affect participant’s performance, it seems odd to skimp on details just because the factors are not of current interest. There are tons of studies on priming effects, but we seem to be unperturbed about writing that experimental probes were part of a larger set of tasks that (we assume) are not relevant to present concerns. Given that supplemental materials can be placed online, why not insist on providing the details and letting the entire scientific community judge their relevance?
3) In an earlier column, I suggested that attention to experimenter expectancy effects seems to have fallen out of fashion. Why not require that authors report whether or not the experimenter was blind to the hypotheses? [iii]
4) As<\span> was noted last month, Barbara Spellman, the editor of Perspectives on Psychological Science, and others are working to develop an archive of attempts to replicate experimental phenomena.[iv] Why not require authors, again in supplementary materials, to describe any related studies they have conducted for the same hypothesis but have chosen not to publish? [v] (I would make an exception for studies that have blatantly flawed designs.)
5) Another rule with lots of exceptions [vi] might be to include the actual data in supplementary materials. Some journals, such as Judgment and Decision Making, already have this rule.
Well, I’m going to stop here because I don’t want to consciously or unconsciously plagiarize other board members. My tentative bottom line is that we could add a touch more rigor to our empirical efforts and that it may be feasible to do so by some slight shifts in publication policies.
But we don’t want rigor mortis.
Some well-established areas of research may be like Phase III clinical trials, in which the methods and measures are settled issues and the only concern is with assessing effect size. Other areas, however, may rely on open-ended tasks in which the dependent variable cannot and typically should not be specified in advance. For example, to analyze people’s sortings of (pictures of) different species only in terms of taxonomic relationships would leave researchers blind to alternative organizational schemes (such as sorting according to the habitats where species are found). In her dissertation studies, my former student Sara Unsworth [vii] got a great deal of mileage out of asking rural Wisconsin Native American and European American adults to tell her about “their last encounter with deer.”
This sort of work raises different challenges with respect to rigor, as typically it just isn’t feasible to specify a coding scheme in advance. I’m not sure what we know about the science of developing coding schemes, and our standards for establishing inter-rater reliability, in my opinion, remain underdeveloped.[viii]
I guess this is all part of what makes our field so exciting. We have a large advantage over other sciences in that our focus on human cognition and behavior naturally includes researchers and the psychology of their practices. We are intrinsically part of that which we study, and that is why rigor without rigor mortis not only advances our science but is part of it as well.
All of the Board members participated in the December discussion. Here are representative comments from a few:
Popularity Shouldn’t Define Scientific Significance—Lisa Feldman Barrett
Technology Could Help—Susan A. Gelman
Universal Rules Could Be Problematic—Roberta L. Klatzky
Impact Factors Have Too Much Influence—Morris Moscovitch
We Need to Work on the Bigger Questions—Gün R. Semin
Replication Will Expose Cheaters—Joseph E. Steinmetz
Popularity Shouldn’t Define Scientific Significance
1) Recently, there has been a premium on “innovation,” “transformation,” and “paradigm-changing” research. This is important, of course, but it overlooks the importance of “normal” science, in the Kuhnian sense. Grant applications are now not being funded, merely because they are incremental. Not everything has to be paradigm shifting to be valuable.
2) There seems to be a blurring of boundaries between popular and scientific impact. Until recently, most scientists did not care whether or not their work was communicated to the public. This was a problem of course, but now the pendulum seems to have swung in the opposite direction: Sometimes it appears as if we care too much, and the science suffers for it.
Scientists now have competing goals. One is to publish work that is newsworthy (e.g., to be mentioned in the New York Times science section). A second is to publish work that is theoretically important and makes a significant contribution to the scientific question at hand. These are not necessarily the same, and so should not be confused. But they often are. Findings in papers are often hyped in a way that is more appropriate in a press release than in a scientific paper. Students now cite popular books (which are, at best, a secondary source) as evidence of some finding or effect , instead of citing the scientific papers. Often papers are triaged (in Science for sure, and some even claim this is happening in Psychological Science) because they are not newsworthy or splashy even though they are quite scientifically important.
Often, when we try to communicate things to the public (e.g., calling freezing behavior “fear” and calling the acquisition of freezing to a tone via classical conditioning “fear learning”), this filters back into the science itself in a way that is not helpful (e.g., the belief that “fear” has a unified biological cause).
3) The public still does not have a good grounding in the value of science and science education. Hence, they believe that there should be applied value in research that delivers right away. They often don’t understand that a theory is not a speculation or a hypothesis — it is a scientific explanation that is well established with data — or they confuse an effect with a theory.
4) Many psychology students no longer receive education in philosophy of science, and this limits the scope and validity of their theory-building attempts.
– Lisa Feldman Barrett
Continue the Conversation–Click Here to Make A Comment
Technology Could Help
In the interest of encouraging replication and promoting transparency in evaluating methods, I suggest that each published paper include a video of the experimental protocol (faithfully reproducing the context, stimuli, spatial layout, experimenter intonation, gaze, pacing, feedback, etc.). This would essentially serve the purpose of what current methods sections are intended to do (permit others to replicate one’s research) but would use current technology to capture much more detail and nuance than is possible with a brief verbal description. This small step would potentially have several benefits: (a) replication attempts would be more uniform, and the effects of slight procedural variations would be easier to measure; (b) methodological flaws in items or procedure would be more apparent; (c) unconscious cuing of participants may be detectable; and (d) researchers may be encouraged to be more accountable in ensuring that procedural details are thoughtfully considered in the design phase of the research and uniformly followed during data collection. There are serious issues to be addressed regarding how to maintain realistic fidelity without introducing IRB concerns re confidentiality, but I think these issues are solvable.
– Susan A. Gelman
Continue the Conversation–Click Here to Make A Comment
Universal Rules Could Be Problematic
I’m all in favor of rigor and view my own work as high on the appropriate scales, whatever they may be. That said, I think that attempts to capture best practices by a set of rules are almost certainly doomed to fail, given the diverse nature of psychological science. Psychophysical experiments, for example, have been published with an N on the order of 2, possibly with only the authors (who obviously know the hypotheses) being willing to undertake the tedious hours of data collection with a repetitive task. That may not be the norm, but it illustrates why restrictions shouldn’t be expected to apply universally. My own work often uses instruments that can measure the positions and forces people exert over time, with the possibility of dependent variables exploding accordingly. If I discover that a variable affects jerk (the third derivative of position) rather than acceleration (the second derivative), am I prohibited from publishing?
-Roberta L. Klatzky
Continue the Conversation–Click Here to Make A Comment
Impact Factors Have Too Much Influence
There are three main criteria by which we judge scientific work: rigor, importance in the sense that it makes a significant empirical and theoretical contribution, and general interest. It is right to focus on the first of these criteria because it essentially is the only one to which a set of rules or procedures can be applied — but it is the one that causes the least trouble. Fraud or failures to replicate do not arise because the studies were lacking in rigor, at least not insofar as a panel of experts could judge. Many of the suggestions regarding practices that would facilitate judgment of scientific rigor are good ones, such as publishing raw data (though we already have a system in place that requires us to make raw data available on request). However, allocating journal space or cyberspace to indicate failures to replicate adds noise to a system (how are we to distinguish poorly executed studies from proper ones?), and requiring a statement from authors as to whether the successful study was accompanied by many nonsuccessful ones would seem to invite evasion, if not mendacity.
The more difficult problem concerns the other two criteria, since there is a strong subjective element to both. In order to deal with this subjectivity, the scientific community has tried to introduce a measure of objectivity. Citations and their derivatives, such as the h-index and impact factors, have assumed a measure of importance out of all proportion to their usefulness, so that rather than merely taking the pulse of scientific discoveries, they are used to prescribe a scientific regimen.
It is easy to see how we’ve arrived at this state of affairs. Citations, which are meaningful indices only after an article has been published, have been subverted to determine the fate of an article before publication. Here’s how it works. Journal editors and publishers used citation counts as a way of determining the impact an article published in a given journal has on the field and derived impact factors based on that. Once this was in place, articles were judged not only on their own merit, but on the impact factor of the journal in which the article was published. Because of competition among journals to keep impact factor high, articles came to be judged not only on the basis of the first two criteria — rigor and importance – but also on the basis of the third — general interest, which has little scientific merit aside from drawing public attention to the article. As an analogy, consider a criminal trial in which the jury is instructed to take into account the effect their verdict would have on public opinion before rendering a decision. This mind set is reflected in a journal style in which all or part of the Method section, where rigor is judged, is relegated to the back of the paper and, more recently, to a supplementary section that is available only online and for which a separate search is required. In addition, to entice high-impact scientists to contribute to high-impact publications, reviews had to be rapid and turn around short, both militating against careful scrutiny of the publication. We quickly went from using citations as an imperfect measure of a paper’s impact to having them determine ahead of time what kinds of papers will be published.
Most scientists can tell which way the wind blows, and if some are obtuse, tenure committees, granting agencies, and government ministries will make sure their senses are sharpened. Promotion and funding to individuals, departments, and universities (see the example in the UK and France) is based increasingly on these measures. Knowing that they are judged by these “objective” measures, many scientists, myself included, have succumbed to the lure of publishing short, eye-catching papers that will get them into high-impact journals, rather than submitting a paper with an extended series of experiments. We have seen this trend in our own flagship journal, Psychological Science. Our boast of having over 2,000 submissions a year reflects not only the quality of the journal, which is high, but also the fact that its impact is high and its articles are short. One or two experiments, rather than a series of them, will get you in.
When I was a post-doc, an eminent psychologist who sat on the scientific review panel of the Canadian equivalent of NIH told me that in the 1950s and 1960s, a publication in Science or Nature was given no more credit than a book chapter and far less than a publication in a specialty, archival journal. The reason was that it was difficult even then to know on what basis the article was accepted for publication in Science or Nature, and given how short it was, it was difficult to judge the rigor of its methods. I doubt we can return to that time, but we can downplay the importance we attach, not to citations, because they occur after the fact, but to journal impact factors. To increase rigor, we can return to requiring a series of experiments on a topic before we accept it for publication, even in a journal like Psychological Science.
– Morris Moscovitch
Continue the Conversation–Click Here to Make A Comment
We Need to Work on the Bigger Questions
The majestic production of papers based on fictive data produced by someone who was assumed to be a very respectable member of our community and published in very “respectable” journals has been a major source of reflection.
I shall take this opportunity to draw attention to an issue that provides a possible account for the undetected flourish of the extraordinary event that came to light. It is the theoretical as well as “phenomenal” permissibility that our science and some of our prestigious publication outlets encourage. The absence of a true paradigm in the Kuhnian sense, the absence of truly integrative theory, the absence of a problem that requires collective attention and research is undoubtedly one of the contributory factors allowing this type of misconduct to pass undetected for such a long time. The fractioning of the quest for knowledge to sound bites is becoming the criterion by which quality and significance are being judged, and our graduate programs are becoming increasingly sophisticated in training the next generation with these goals in mind. This means that we have to reflect and work upon the bigger questions that capture the imagination of many competing for the answer for the answer’s sake. This means that we have to train the next generation to identify big questions, teach them to separate the big ones from the seduction of sound bites, and to learn to work in teams.
The recent revelation of misconduct, the full magnitude of which we shall only hear closer to spring of this year, is also diagnostic of what we value and why we confer high accolades in our profession, since the culprit in question had accumulated all possible honors in his field of practice and beyond. The shift from the individual to the team, a process that is in the making, will also contribute to a rethinking of the distribution of rewards as well as of the administrative and organizational structures we have to adopt in order to bring about these changes that are essential for our science to progress and reduce the hiccups we occasionally experience.
– Gün R. Semin
Continue the Conversation–Click Here to Make A Comment
Replication Will Expose Cheaters
I believe three points should be considered in this discussion:
1) Cheating and scientific misconduct sadly happen in all fields of science and take many form,s from the outright forging of data to not reporting all of the data that have been collected. Psychological science is not different in this regard, and we need to come to terms with the fact that there are dishonest people in our field.
2) Replication, a distinguishing feature of science, ultimately ferrets out cheaters — it just takes time. While it is important that we take steps as a field when possible to prevent scientific fraud, it happens, perhaps by the way data are handled and reported. I hope the field does not substitute regulation for replication in its attempt to legislate this bad behavior. Replication remains our chief tool for eventually exposing cheaters.
3) The overwhelming majority of scientists in our field are honest and diligent, and these honest people are our ultimate tool for countering cheating — they sense when something isn’t right, and as long as our institutions maintain an open and non-intimidating atmosphere, our honest colleagues will expose the cheaters. This happened in the case that triggered this discussion.
– Joseph E. Steinmetz
Continue the Conversation–Click Here to Make A Comment
Footnotes
[i] Given that our field is an empirical science, I’ll just note that this (dubious) claim can be tested.
Return to Text
[ii] The recommendations listed at the end of the recent Simmons, Nelson, and Simonsohn (2011) paper also constitute good material for discussion. For example, they suggest that authors should be required to decide the rule for terminating data collection before data collection begins and report the rule in their article. I can see the value of this principle in certain areas of research, but it may not be so practical in other ones. For example, in the cultural research conducted in my lab, our informal rule is something like “let’s run a few pilot participants to see how variable the data are going to be and then interview enough informants so that we can detect fairly large differences.” Return to Text
[iii] Of course, there are many situations where blindness or double blindness is not feasible. My aim is just to increase the practice when it can be done. Return to Text
[iv] I should have added that Harold Pashler and Barbara Spellman are collaborating in this effort, coordinating what started out as two independent projects. Return to Text
[v] A postdoctoral fellow in my lab, Sonya Sachdeva, told me about attending a talk where at some point the speaker mentioned that “it took me ten studies to finally produce this effect.” Return to Text
[vi] A case in point involves rich data sets (e.g., video observations) that might be analyzed in multiple ways for different purposes or to ask different questions. Here, authors should probably be given some reasonable amount of time to explore their own data before making them publicly available. Return to Text
[vii] Sara is now an Assistant Professor at San Diego State University. Return to Text
[viii] For example, “acceptable reliability” standards strike me as a bit arbitrary. I wonder, for example, if some variation on signal detection theory might be applied to adjust for inter-rater differences in criteria for saying some code is present. Return to Text
Comments
TOO MUCH EMPHASIS ON NUMBERS as OPPOSED to QUALITY
1. I am disturbed by the shift to number of papers and away from whether the work represents a contribution to our data base. The move to short notes facilitates this. I worry when a recent PhD or postdoctoral fellow who is a candidate for a job has more than 10 papers and beats out someone with one or two very interesting and deep papers. It is hard to imagine how the candidate had enough time to do enough significant research.
2. Anyone in charge of a lab who puts their name on a paper should be sure to study the data. This does take time but is a fundamental of good science.
3. New and theoretically important results should be replicated by the lab. A good idea to have someone who is naive to the hypothesis and result do this. An example –
Gelman, R. (1982). Accessing one-to-one correspondence: Still another paper about conservation. British Journal of Psychology, 73, 209-220.
4. Work with young children can require a series of attempts to develop a working method. These rejected efforts should be described; otherwise, the risk of a failure of others to replicate looms.
5. Criticisms that fail to include the original condition of a challenge to a result are inappropriate. Authors should be able to provide details of their method, including a videotape.
5.One should actually read the papers that are cited.
6.Take time to read papers published before 2000.
Lisa: You are absolutely right. The efforts to be flashy and newsworthy, in my view, are contributing to an environment where science (systematic building of knowledge) takes a back seat. I sometimes wonder if we as as field are inadvertently encouraging this in our understandable efforts to gain publicity for the good work that psychologists do.
I fully agree that replication is a critical step in establishing the validity of scientific findings and ferreting out errors of all sorts. Unfortunately, it is difficult (if not impossible) to get research grants to support replications, replications do not count towards tenure or any other form of professional recognition, and replications are rarely publishable. Thus, this avenue for confirming the validity of interesting findings has been pretty much closed by various aspects of our professional culture.
Erroneous or fraudulent medical cures are continually proposed in biology, and erroneous or fraudulent cold fusions or perpetual-motion machines in physics, but they do not contaminate those scientific fields because they are investigated and disconfirmed by other scientists. Although published studies in psychology should include the details of design and analysis relevant to the internal validity and to the generality of the particular study at hand, the way to address the problem of research integrity is not to impose rigid publication regulations and demand endless assurances from investigators. Rather, the solution is for the field to reward the proposing and documenting of highly significant ideas and discoveries. If the discovery is good enough, other psychologists will want to follow up. If I find that a core dimension of personality is a significant predictor of long life, others will want to know when and why. It will not matter how many author instructions and checklists there were in the journal.
I enjoyed reading all the comments by members of the Board and found much agreement with all of them to varying extents. What follows is a bulleted list of reactions.
• It is not only coursework on philosophy of science that has been lost. I see a constant battle to maintain graduate coursework on research methods. As a young colleague said to me recently, ‘My students will learn methods in my lab.” That leaves a large burden on the lab PI to teach the next generation about things such as the need to do counter-balancing and the need to NOT do such things as selective reporting and adjusting the study in the middle.
• Science, Nature, and PNAS papers are golden tokens in our field but the short form leads to a dearth of critical details and selective reporting that is very dangerous. Even the supplemental materials section does not eliminate the ability of a research to leave out inconvenient truths if they so choose. These journals also get lots of attention in blogs and other web-based media and spread in a viral fashion conclusions that would not be merited given a closer look at the data.
• I like to tell my graduate students that the most important thing that psychology has done is to invent the introductory psychology class for giving psychology away. We have been doing that for years before George Miller said that is what we should do. As someone who teaches introductory often, I am concerned that what sells introductory books is often the flashy and sexy findings that have not been given the scrutiny of replication.
• Related to my last bullet, in some ways psychological science has become way too popular and marketable so findings go to the web and to print media very quickly and often to commerce before they are proven. I like to remind people of the ‘discovery’ of easy cold fusion sometime ago. Before it was shown to be, not an intentional hoax, just sloppy science, the state of Utah had allocated millions of dollars to the implementation of this quick, clean, and simple energy solution. I see very similar things happening today in areas such as cognitive and brain training which is claimed to fix just about everything from autism to schizophrenia.
• Intentionally faked results are, I think, not that common but sloppy science in the pursuit of funding, fame, and job security are much too easy to get away with. The new web site associated with Perspectives is a great first step.
1. I agree that replication is an important way to assure honest findings in the long run.
2. I also agree that making data available to others is a good idea but I have found that often authors will not do it. There needs to be some type of enforcement mechanism.
3. I always let grad students do the data analysis but I look at it closely to make sure nothing seems out of whack.
4. I always decide on the number of subjects in advance.
5. Here is a shocker for you: the obsession with the hypothetic-deductive method of doing research is a major cause of both bad theory building and making up H’s after the fact. Let me stress that this is not really the fault of authors. Editors openly tell authors to do this–I have seen it happen. The fundamental flaw here is that science building is basically inductive; people should not be forced to make up theories and then make deductions–it’s very phony. Good theories take many years to develop based on lots of studies and integration of findings. Many findings in science are unexpected. My goal setting theory with Latham took 25 years to develop; under modern editorial standards, it never could have been created at all. It is ok to have hypotheses but they should not have to be deduced from non-existent (made-up) theories. The whole approach to theory building in psychology(and probably related fields) has been perverted.
Scientific fraud is of course a bad thing, but it is going to happen no matter what safeguards we put into place. The stakes are simply too high. In fact, there’s no major discipline that doesn’t have its cases of fraud — medical science is certainly replete with examples. What’s important to remember is that science is a self-correcting enterprise. Sooner or later, fraudulent findings will be undermined by failures to replicate. We should keep our eyes and ears open, of course, but rather than tearing out our hair when the inevitable frauds occur, we should congratulate ourselves for making matters right. A few fraudsters do not undermine the good work that the vast majority of us do.
The problem may be beyond help, but it stems in part from a trend I have argued against for many years: a growing emphasis on the number and consistency of topic in publication. It has reached insane levels when it is the only kind of evidence accounting for evaluations of faculty and students alike. We need a major, massive culture change in graduate education in psychology to make breadth of psychological knowledge of those who are students (and their faculty) share evaluative importance with specialization, to emphasize data, if properly replicated, on which theories can eventually be built, and to insist on time-consuming attempts (including counter-balancing!) to build into research plans (OK: “research designs”) safeguards against procedurally-biased, unreplicable results. Moreover, student research (and at least some faculty research) should be encouraged to follow Ed Locke’s 5th point. Not every study needs–or should–conclude with a new theoretical position! I’d be more inclined to accept a journal submission that concludes only with a set of stimulating questions suggested but not answered by research being reported.
I’m glad that the Board is talking about these issues (and hope you all had a chance to read the special section in the January issue of Perspectives.)
Meanwhile, our website:www.psychfiledrawer.org
has hundreds of views but very few additions to the list of failures to replicate. WHY?
It takes only a few minutes. People groan about their failures to replicate OTHER people’s research all the time. Are we all afraid to put our name to those results?
Various people have suggested to me that no one would spend time posting stuff on the web that wouldn’t benefit themselves. My answer: Have you ever used wikipedia? Yes, we need to change our culture.
Replication is essential, but is not appropriately rewarded, nor published enough. Can we change this?
When findings conflict, we need authors to do as Latham and Erez did under the guidance of Locke. (Latham, G.P., Erez, M., & Locke, E.A.(1988). Resolving scientific disputes by the joint design of crucial experiments by the antagonists. J. Applied Psych. Monographs, 73, 753-772.
A number of commentators, including Doug Medin, recommend increased attempts at replication. If a published study in fact represents a Type I error, replication will indeed help, since there will be only about a 5% chance that the finding will replicate. But if there is in fact an effect or relation (the usual case), replication per se does not work well. If there is an effect or relation, A Type I error cannot occur; only Type II errors can occur. Studies by Jacob Cohen and Gerd Gigerenzer have shown that average statistical power in psychological literatures is approximately .50 and has not increased over time. So if the first study obtains a significant finding, the probability of replication is only 50%. Likewise, if the first study gets a nonsignificant result, the probability that this finding will replicate is also .50. So replication alone is not a solution. But multiple replication attempts are essential to cumulative knowledge, because replications are necessary to allow a meta-analysis of the question to be conducted. Therefore websites like the one set up by Barbara Spellman (editor of the APS Perspectives journal) that allow posting of any and all attempted replication studies are essential. As many have noted, journals do not publish replications, and without replications meta-analyses are not possible. And without meta-analyses cumulative knowledge is not possible.
This comment expands on my previous comment on the role of replication. Some commentators have recommended that individual published articles should include at least one or two replication studies of their key hypotheses. The idea is that if an article confirms an hypothesis and then replicates this confirmation in two additional studies, we can be fairly sure the conclusion is not a Type I error. This is true, but there is a problem with this model in practice: Many such published articles report successful replications that objectivly have a low prior probability of occurring when in fact the hypothesis tested is correct (and a relationship does exist). For example, if statistical power in the first study (based on the effect size observed and the sample size) is .50 (a representative value–see my earier comment), then the probability of replication (the probability of getting a significant result) in the two subsequent replication studies is (.50)(.50) = .25. So when we look at articles containing three studies of this sort, we should see significant results in the two replication studies in only about 25% of such articles. My impression from reading Psychological Science is that this figure is actually 100% or very near 100%. This strongly suggests something is seriously awry. In the latest issue of the APS Perspectives journal, the article by Bertamini and Munafo comments on this phenomonon: “Indeed, there are far more statistically significant findings present in most literatures than we would expect even if the effects reported are indeed real (Ioannidis, 2011).” (p. 70) We are not talking about Type I errors here. We are talking about an unnatural and very suspicious absense of Type II errors which statistical logic says should be there but are not. I have commented at length on this because I do not see that any of the other comments have addressed this problem.
Replication is surely a crucial aspect to harness research production within ‘ethical’ routes. Yet, replication cannot be demanded only by scientist. In fact, replication should be promoted by Journals and Institutions, by Associations and policy-making bodies, ultimately by Lab-leaders (within and between labs replications). researchers and scholars in different disciplines are motivated to publish original and innovative research regardless of the detail of the methodological description. In other words, The entire scientific production system should turn its back to a ridiculous “scoop disease”.
The more that we turn universities into competitive industries, the more we will encourage such practices. It is entirely predictable.
Replications are often considered a ‘gold standard’ – yet one must worry about replications that
(a) merely report capitalization on chance again and again
Example: When N=10, significant effects (correlations, mean differences, etc.) will be huge no matter how many times you replicate.
(b) merely repeat bias again and again
Example: A well-accepted measure of a construct might generate consistently different results than other legitimate measures of the construct, no matter how many times you replicate.
Meta-analyses of replicated effects can address some of these problems, but generally, preventing problems within studies is preferred (e.g., better to conduct large-N studies than to correct for small-N studies post hoc).
Best,
Fred
–
Fred Oswald
Associate Professor
Department of Psychology, Rice University
http://www.owlnet.rice.edu/~foswald
My comments as a retired Fellow of both APS and APA are going to be markedly different and my guess is that APS officers and members are not going to like them or will be uncomfortable about them. Nevertheless, I would be shirking my responsibility as a psychologist and as a patriotic citizen (“my country, please do right and no wrong”)if I were to ignore this invitation by the APS to submit my comments.
With the exception of fraudulent research that could lead to deadly or injurious consequences, and I can imagine that seldom being the case in the field of psychology, the APS is worrying about a relatively miniscule and inconsequential problem (too much psychological research in my opinion is inconsequential)and certainly has the capability of trying to not only counteract bad PR but also to take steps to clean its own house regarding that problem.
There are two kinds of wrongdoing, illegal and legal; the first being behavior that falls below the bottom line of the law; the second being behavior that falls below a higher standard, the bottom line of ethics. I would guess that most if not all fraudulent research in the field of psychology is of the legal kind of wrongdoing.
I’m going to draw your attention now to a ruinous and sometimes deadly source of wrongdoing of both kinds. The direct source, manifesting itself daily, is the corpocracy, America’s own worse enemy and a scourge of millions of human beings around the globe.
The corpocracy is the Devil’s marriage between powerful corporate interests and the three subservient branches of America’s government. The corpocracy is directly responsible for America among advanced nations having the worst socioeconomic conditions and for being, and strictly for self-serving purposes the most murderous nation overall since WWII. Much of the corpocracy’s wrongdoing is legal because the corpocracy has made it legal.
I have studied the corpocracy for more than a decade; have recently written a book about it; and am doing everything I can to motivate Americans to launch their “two-fisted democracy power,” figuratively speaking against the corpocracy. I know how it operates and how powerful and consequential it is in all spheres of human life; be it personal, cultural, economic, political, or environmental.
In my research I have noted an unmistakable and alarming trend in most fields of science and the professions, including in the research and practice of psychology. These fields and professions are becoming increasingly “corporatized;” that is, they are becoming more and more compromised and neutralized by government and corporate funding. They have in effect become an unwitting, unknowing, or self-denying ally of the corpocracy. Public universities, for example, have become sources of corporate tainted research and teaching. Psychologists look around and at your selves. Do you spot, for example, any corporate tainted research or teaching? Do you see here and there a corporate endowed professorial chair? Is any of your work funded by the government or corporations, and is the nature and outcomes of your work unduly influenced by your patrons?
I will close by citing the most egregious example of corporatized psychology I can think of on the spur of the moment; by urging one or more psychology researchers to do what I do not have time left to do (I’m 76 and darned near died from a heart attack last December so I have to moderate my energy and choose my activities judiciously); by giving you two references; and by asking you one more question.
The example, without further commentary is the role of psychology and psychologists in interrogation of individuals suspected of being terrorists and in the overall field of military research and training.
My exhortation is that out there in APS land one or more professors and graduate students ought to undertake a study of the corpocracy’s compromising of our field.
Thirdly, if you should want to look into the background behind my comments, you could ask your library to get so that you could read a copy of the book, The Devil’s Marriage: Break Up the Corpocracy or Leave Democracy in the Lurch; and you could read what I have to say in my website, http://www.uschamberofdemocracy.com.
Lastly, I ask you this final question: What are you doing in your capacity as a psychologist for or against the corpocracy?
Our current university/academic culture has something to do with problems related to dishonest research. A decline in collegiality and even just acquaintance has occurred over the years of my career. Faculty live in all directions at distances from the universities so that they may scarcely know each other. Many departments are large and fragmented, and faculty members have little knowledge of what their colleagues are doing. Faculty colleagues do not often enough discuss their work with others, and they even less often read what others have written Promotion and tenure decisions are made too often by simply counting publications and adding up the grant dollars, not by reading, thinking about, and evaluating the work involved. I keep wondering how all these notorious cases of fraud could have gone on so long without any of the perpetrators’ colleagues being at all suspicious. Let alone their department chairs.
Fred Oswald has a good point about repeated measures causing a serial increase in error rate; this is why it’s important when performing either replicate studies or meta-analytic studies that family error rate be controlled (e.g., by using a Bonferroni correction). This will limit the effects of false positive results.
Medin brings up a good point in his principal article about Phase III clinical trials and effect sizes. In these trials, conditions are controlled as tightly as possible (double-blinded, placebo-controlled, with a pre-selected patient population). The results largely derive from the receipt of placebo and treatment. Within the broader psychological literature, there are fewer controls on many of the extrinsic factors which all contribute to the introduction of variation in the data. This confluence of “life factors” is what makes psychological research interesting, but also at times questionable.
If every psychological experiment could be performed with the rigor and elegance of an electron confined in a magnetic field to measure quantum effects (and repeated 10,000 times over a period of three days), we may not have to field these issues; However, people don’t work that way, and it is the human condition we study, and not nuclear phenomena.
Ben Locwin
Joseph,
I agree that replication is extremely important. Until we as a discipline begin to reward replication it is not likely to happen on a large scale. I think our priorities are not where they need to be. In undergraduate research methods courses and statistics courses I talk about the importance of replication, however out there in the real academic world, there is no incentive to replicate previous studies.
Thanks to Doug Medin & the Board members for initiating this open discussion.
I agree with Fred Toates’ comment; we are, in some ways, reaping what we have sown.
I also want to add that the pace of science has become disturbingly fast. Data acquisition, analysis, and publication are now accelerated to create the shortest timeframe — from conception to dissemination — ever seen. Is such fast science a good idea? Is more really better?
I think of research output like reproductive output, following r/K selection theory: in a nutshell, some follow an r-strategy (produce many publications, invest little in each) and some tend to follow a k-strategy (produce few, invest heavily).
(for more) http://en.wikipedia.org/wiki/R/K_selection_theory
These strategies are strongly influenced by the environment in which they are executed. K-strategy, what it seems most are advocating for in this commentary — a careful practice of science where quality is emphasized — needs a stable environment in which to thrive. Academia has become an unstable and unpredictable environment, particularly with university budget challenges and when support for psychological science is tied to political ideology (certainly the case in my line of work — sexuality research).
How can we create a more stable environment in which careful science can thrive?
Back to writing yet another grant where I plead the case for relevance of the science of sexuality (beyond making flashy headlines).
Cheers,
Meredith
—-
Meredith Chivers
Assistant Professor, Queen’s National Scholar
Queen’s University
Kingston, ON Canada
I agree with your statement that popularity should not define scientific significance. I also agree that psychology students rarely recieve education in philosophy of science and this can limit the scope and validity of their theory construction skills. Teachers of psychology might adress this problem by integrating this kind of material with their teaching of psychology courses. It is especially important for students to understand how hypotheses serve to coordinate theory with empirical evidence.
This is a very interesting and timely Board forum and elicited some great comments. APS is to be commended for: a) having such a diverse and knowledgeable Board to review the topic; b) having put in writing the Boards individual idea’s and then allowing the comments that were presented; c) having allowed all of us to see the balance that is emerging related to how to deal with this issue. It appears as if APS want this issue to be dealt with in a corrective, but positive manner. I really like the idea of citing prior pilot work. What an excellent way to say, “hey I did this twice, even if a small N, and got the same results or different as that is noteworthy too.”
At the age of Internet and Endnotes, Is not the scotomisation of older papers the first scientific fraud ? Alas, we can often read papers and even review articles that, obviously, lack an adequate literature review.
Michel Cabanac
Department of Neuroscience and Psychiatry
Laval University
1. It is elementary that replication is the only way of ensuring that one is doing the experiment the right way, or discovering that previous attempts were incorrect due to a flaw. So replication is scientifically essential – not just important as a screen for fraud. Successful replications can be footnoted; unsuccessful ones, explain in detail.
2. Many of the other points made above concern scientific quality, not overt fraud. These points indicate serious lapses in reviewing and editing. The bar has to be high for the research to be worth publishing. Journals and individuals who publish poor quality work need correction; one hopes at least that their citation rates go down. Everyone should try their best to see if the conclusions really follow from the premises and the results; Lisa’s example is just one of many ways the logic can fail.
3. Because the frontier of science is unknown, publishable results most often contain surprises. So reviewers and editors have to judge papers by the quality of the work and the importance of the question, not by what they expect to happen. Researchers have to judge their own results through the same lens. As a long-term (20+ years) editor, I suspect these are the two greatest problems of all – how to avoid preconceptions and see Nature for what it is, and how to verify the logic of the experiment. I think these are way, way more important than fraud, nasty as the latter is.
In his November 3, 2011 article “Fraud Case Seen as a Red Flag for Psychology Research,” New York Times science writer Benedict Carey mentioned Daryl Bem’s article on extrasensory perception in the March, 2011 issue of the Journal of Personality and Social Psychology. He stated that “In cases like these, the authors being challenged are often reluctant to share their raw data.” In Bem’s case, he was mistaken. Bem’s raw data are available to anyone requesting them as are the source codes for the computer programs that ran the experiments, the databases for analyzing the data, and instruction manuals for those who wish to try replicating the results. Carey also mentioned criticism of Bem’s statistical procedures without noting that a rejoinder to that criticism was published in the October, 2011 issue of that same journal.
In my blog post (http://www.bestthinking.com/thinkers/science/social_sciences/psychology/michael-smithson?tab=blog) on the Stapel case, I raised several of the points that have been made in this forum thus far. These pertained to the opportunities, means and motives for cheating in research that are enabled and supported by the current scientific culture in psychology (and not unique to psychology). My main recommendations boiled down to developing a publication culture that is not biased towards publishing only statistically significant results, that encourages genuinely independent replication, that treats well-conducted “failed” replications identically to well-conducted “successful” ones, and does not privilege “replications” from the same authors or lab of the original study. Replications, meta-analyses and Bayesian methods are not problem-free but they would make obvious cases of fraud easier to detect relatively early on.
In my post I also highlighted the addiction psychology has to hypothesis-testing. I claimed that a deleterious consequence of this addiction is an insistence that every study be conclusive and/or decisive: Did the null hypothesis get rejected? My suggestion for a remedy is a publication culture that does not insist that each and every study have a conclusive punch-line but instead emphasizes the importance of a cumulative science built on multiple independent replications, careful parameter estimation and multiple tests of candidate theories. I observed that this adds to the rationale to calls for a shift to Bayesian methods in psychology.
But now I’d like to draw out another inference, related to points raised by Professors Moscovitch and Semin. The current hypothesis-driven culture in psychology also is experiment-driven. Experiments and hypotheses are our “sound bites.” They dominate our journals at the expense of both careful descriptive studies and exploratory investigations of heretofore unexamined or poorly understood phenomena. Both kinds of studies often tackle topics where there isn’t yet enough theoretical development to form meaningful hypotheses, let alone design experiments to test them. Because grants and publication outlets for such studies are hard to come by, I suspect that the discipline as a whole suffers from a tacit intellectual dishonesty whereby many of us attempt to persuade ourselves and others that we possess greater prior understanding of psychological phenomena than really is the case. While this kind of dishonesty isn’t cheating per se, it still is a stain on our integrity and produces premature theorizing and experimentation.
I think the problem in our field is much more serious than the (hopefully) very rare cases of downright fraud. Recent papers that review or document psychological scientists’ actual practice suggests that data finagling or — more subtly — extremely motivated data treatment, occurs very often (e.g., John, Loewenstein & Prelecis, in press). I understand this as a cluster of misunderstandings regarding the nature of science in which the data is taken to be an embellishment or decoration of one’s elegant ideas. In my young career I had already what I thought were very good ideas but unfortunately most of them proved to be simply wrong. My colleagues and I did find a few actual phenomena on the way and maybe these will also prove to be somewhat interesting. So, unfortunately the solution comes down to the old unsophisticated principle of being honest, especially to yourself. Rigorous scientific education won’t hurt none though.
Here is a good one: How is it that we can use measuring tools that assume each unit interval is equal to each other, when we are measuring subjective phenomina?
My background is in physics and I find the way psychologists do science often perplexing. For example, every original psychology research paper is supposed to have an hypothesis and then an experiment to prove or disprove it. No physicist does research this way. I bet most of them will even show a blank face when asked about what their hypthesis is. Physics is probably the most successful science ever and research is done by theorists, by experimentalists and sometimes by people who are both. There is no requirement that everybody has to do experiments. And papers are written that describe experimental or theoretical results – there is no need to have a hypothesis. And, also interesting, there is plenty of interest in looking at old findings, not only new findings.
Make it easier to publish replications. I and my students have done many replications and have discovered that journals are rarely interested in publishing such work; especially, of course, when the replication fails to produce the appropriate finding. How many unpublished negative results does it take to refute one published result? We almost never know how many failures there have been.
I agree that scientific dishonesty could only be reduced through some kind of extension of the formal requirements for manuscript submission. This could include the following:
1. A requirement to register the research hypotheses in a central database prior to data collection.
2. The submission of a scanned form containing a brief description of the measure used (e.g. the experimental procedure or questionnaire content) and the dates of data gathering sessions signed by each participant. (If the study targets sensitive issues and the participants may wish to remain anonymous, either the confidential treatment of this form would have to be ensured, or some other proof of when the data collection took place could be provided.)
3. The raw data should be submitted with the manuscript.
Of course, it is sad if requirements such as the above need to be introduced, especially as the additional tasks would place extra burdens on researchers already spending much of their valuable time doing administrative work for research proposals and the like. However, in many other areas of life bureaucratic measures have been introduced as a response to the behaviour of a dishonest minority. I see no reason that science could be an exception.
Missing scientific, and thus personal, integrity is in no way genuine to psychological or at least social sciences. Science is a vehicle for pursuing personal goals of various kind: money and power, self-extension and fulfillment, leaving a notch in history; but also the well-being of individuals and groups in many ways. This is not “good” or “bad”, but is essential to science being in need of a perpetuating momentum. HOWEVER, this is the case in any scientific discipline you can think of: it is “universal”. Again, this does not mean that it is “good”, but that we have to expect “fraud and scandals” in any place where people can aspire and achieve personal goals.
One of the people behind ‘http://www.psychfiledrawer.org’ stated “People groan about their failures to replicate OTHER people’s research all the time. Are we all afraid to put our name to those results? ”
Good point, I was running labs in psychology for many years and we often chose research where we tended to get ‘significant’ results. So we usually found a right visual field advantage for words, that mnemonics helped memory etc.
However, another constraint was to cover the different areas in psychology research.
Rarely could we find ‘risky shift’ or ‘group polarisation’ effect. Rarely could we find a difference between males and females in the use of maps.
Of course we may not have set up the research well, or the participants (usually psychology students) may have been in some way different.
It would be interesting to see the outcomes of research using large numbers of participants e.g. at http://www.psych.uni.edu/psychexps/
Here is my prescription. Much of this echoes what others have said:
1. Just as it is true that “all generalizations are false”, it is also true that “all categorical rules will be regretted”. Requiring reports of certain statistics, videos of experimental procedures, etc., seem like overkill. Doug Medin mentioned that Judgment and Decision Making (JDM), which I edit, requires publication of data. That is not quite true. Some authors have good reasons for not providing data, and I do not insist that they comply. (And this is stated in http://journal.sjdm.org.)
2. JDM also publishes “attempted replications of surprising results”. Perhaps “surprising” is not the best term, but some results seem to go against my Bayesian prior, and those are the ones I have in mind. Sometimes these attempts succeed and sometimes they fail.
3. Authors should be evaluated by citations of their own articles (with attention to whether the citations are positive or negative), not by the journals they publish in. For new articles that have not been cited, promotion and search committees might just read them and come to their own opinion.
4. Editors should be especially suspicious of articles when it appears that the significant results arose from trying lots of different analyses until one of them worked. There is no rule I can think of to identify such tinkering. But one practice I follow is to ask authors to report what seems to me to be the single best statistical test of the effect of interest.
5. Journals should not brag about news coverage. To me it is a negative. It makes the journal seem like they are more interested in the sorts of non-intuitive findings that usually turn out to be wrong than they are in advancing knowledge.
6. There must be other ways to communicate knowledge to the public aside from getting coverage of new, sexy results. Many of the most important things that psychology has to say to the public have been known for a century.
Of course there are problems in psychology as a science.
Part of the problem is that psychologists are not trained to understand cause and effect. Fitting data does not constitute a proof of cause. Analyses of Variance are often interpreted as if they provide evidence for a cause. Yet ANOVA simply expresses an additive model for effects but does not specify a cause for these effects. ANOVA is a fit to data, often misinterpreted. I attach a message sent to our Vision Journal CLub regarding ANOVA.
Your comments are welcome. Please note that subscripts seem not to be allowed so they follow the approriate variable.
ABOUT THE “INTERACTION”
The performer of statistical analyses assumes and applies a mathematical model to the data on which the analysis is based. This comes as a shock to some psychologists. Once at a faculty meeting at McMaster I got the response “What do you mean a mathematical model? We are not doing mathematical modeling.” I asked, “Well, what do you think you are doing?” The response, “We are computing statistics.”
There are really big misconceptions among many social scientists, among psychologists too, that need to be addressed. The Analysis of Variance (ANOVA) and the idea of interactions is a major statistical tool that is often misunderstood.
First the ANOVA is a mathematical model that is supposed to “fit” the data. In the simplest case the model describes the effect of measurement error on observations. If your observed value is a continuous variable y, then the underlying constant but unknown effect µ (Greek mu), and the continuous error e, are modeled by Gauss (1821) as,
Y = µ + e. (1)
The assumption is that whatever the unknown, underlying, cause, the actual measurement is compromised by fickle measurement error. Furthermore this error simply adds to the constant effect µ. If it is also assumed that the error has a mean value of zero (Why would it be anything else?) then, happily, the average value of the observations, that is the Expected Value will be
E(Y) = µ + E(e) (2)
= µ + 0
= µ.
Problems arose when there was more than just one underlying variable to be measured. How to determine whether there were actually differences among effects dependent on different underlying variables – differences due to the variables not just the error. Especially in agriculture, one wanted to determine if there were differences in wheat yield for several fertilizers.
Two models developed. First one might hypothesize that each fertilizer had its own unique effect. Then for fertilizer i, (i = 1, 2, … , I) we could maintain the basic idea with
Yi = µi + e. (3)
If you do not see how an equation works, put some numerical values into it. In this case suppose I = 4 and µ1 = 5, µ2 =10, µ3 = 1, and µ4 = 4. Then, suppose the error comes from a “Normal” distribution with say mean 0 and variance 1. Pick a value at random and add it to a value of µ1, pick another at random for µ2, another for µ3 and another for µ4 and there you have your data and an illustration of the mathematical model.
The error does not depend on the fertilizer, only on making the measurement, say the amount of wheat grown. This could be thought of equivalently as deviations from an overall mean µ with individual effects αi. This leads to a different interpretation of the measurement of effects.
Yi = µ + αi + e. (4)
Now the effects are seen to be deviations from an overall mean, µ. These deviations must have a mean value of zero (by proof). [Given the values above, what are the values of αi?] If we subtract each mean value Yi from an over all mean Y, we measure the unknown effect αi. Since the αi must add to zero some will be positive and some negative.
But what if there are two separate causes of effect, different fertilizers and different seeds. Let’s just add the second variable to the basic idea. Now,
Yi = µ + αi + βj + e, (5)
where αi is the effect due to fertilizer i and βj is the effect due to seed j (=1, 2, …, J).
In order to determine the unknown values of αi and βj a special experimental design is required. Suppose we figure that out. Now, having run the proper experiment we want to know whether the αi and βj effect changes in our measured values. This is often the role of the Analysis of Variance. We can do the calculations, get F statistics, and wallow in “significance.” Get happy. Have a beer.
After a few beers we begin to wonder if this mathematical model is really applicable to the values that we measure. We can always do the algebra, the calculations, but if the model is not correct what are we doing? How can we determine if the mathematical model is correct?
Suppose the model is not correct. Suppose that there is a another variable due to the joint combination of fertilizers and seeds, of αi and βj . Perhaps these things multiply together, perhaps they act in joint opposition to each other. Perhaps, perhaps, perhaps this, or perhaps that. We don’t know but we want to provide a test of the simple additive model, equation (5). We do it by assuming there is something else afoot by adding into the model another term that measures a deviation of the new model from the simple additive effects model. Namely,
Yi = µ + αi + βj + γij + e. (6)
For each combination of variables, each combination of fertilizer and seed, there is a component γij that effects the measurement in addition to the assumed effects due to α and β.
The model is still additive but now we need to determine which equation fits the data best, equation (5) or equation (6). Again the Analysis of Variance leaps to the fore. Let’s determine if the γij are equal to zero. If so equation (5) is correct, if not equation (5) is incorrect. That’s what a test for “interaction” is all about. The γij are the effects due to interactions of variables α and β.
In ANOVA terms, if the interaction is “significant” you reject equation (5). You reject the additivity of causes. You essentially reject the model as a description of your data. You can not then discuss the data as if the original model applies to it. The model, equation (5), does not apply to the data.
Another point. We are talking about deviations, about subtractions of this from that. Subtraction is ok for real variables. The axioms defining the real number system include an axiom for subtraction. However, subtraction does not exist for either probabilities or proportions. There is no axiom of subtraction in probability theory. The models described here all require subtraction. Thus, these models do not apply to probabilities. The Analysis of Variance, based on real random variables, does not apply to either probabilities or proportions.
Last, suppose we started off wrong in the first place. Suppose the “true” model of effects due to the causes is
Y = µαiβjγije. (7)
Opps, everything multiplies together. There is no additivity at all. If you compute the logarithm you get additivity again,
ln(Yi) = ln(µ)+ln(αi)+ln(βj)+ ln(γij)+ln(e) (8)
All the same questions arise again as well as the question of whether the error e is a multiplier as in equation (7) or just an additive effect so that the better model might be,
Y = µαiβjγij + e (9)
Which is a better tool for the understanding of the effects? Again, a test of the two different models would determine the better mathematical model, but would not determine a cause for any of the effects. Have a beer
and think about your interaction with it.
________________________________________________________________________
[NOTE: In 1837 Poisson provided a model for discrete, not continuous, variables. In his model the variability in measurements is caused by the mechanism generating the data, not by additive measurement error.]
Following Susan Gelman’s comment, I also think psychological science publishing could take better advantage of new media technologies.
In many ways, dissemination of research is still built around the idea of actual paper objects distributed through institutional and individual subscriptions.
As an alternative, imagine online articles (whether “online first” or online only) that link to supplementary files with more detailed descriptions of procedures, including photographs and videos of the lab and procedure, examples of the stimuli that were used, copies of all scales/questionnaires (even those that are already well known, as long as copyright isn’t violated), descriptions of pilot work or unpublished work that didn’t work out as hoped, and (in some cases) raw data for verification and eventual meta-analysis.
These articles (articles 2.0?) would be rich texts that take advantage of current multi-media capabilities and, through hyperlinking, connect with citations and with less conventional sources such as blogs and webpages.
The amount of detail provided would aid efforts at replication and would also reveal the story behind the studies being described. A lot of research has encountered false starts, changes in direction, and an evolution in methodology, analysis and interpretation. Imagine how much readers could learn about the research by not only reading the final, clean paper, but also seeing the paths that the researchers took to get there.
Michael Seto
I did find that getting rid of erroneous papers is pretty much impossible even when everybody agrees the paper is wrong behind closed doors. The particular example is the Atkinson & Shiffrin paper of 1968 and the details of my investigation can be found here: http://www.webmedcentral.com/article_view/1021 and here http://shortmem.com/Short%20Term%20Memory.nsf/dx/12202009014412PMEGTPYQ.htm . I tried to publish it in regular psych journals but everybody agreed that the A&S was wrong so my description of why that paper was wrong was not novel. Meanwhile, the A&S paper continues to get a hundred new citations every year, none of which explicitly states that the paper was wrong! Articles in fields outside of the memory psychology field keep referring to it as well as if the paper was correct. There seems to be no way to cut off this particular piece of misinformation…
I wish to address a couple of practical problems regarding research and science today.
Firstly, I find that a huge problem is the publication bias- that so many studies with “negative” results are not published, and very difficult to get hold of. When other researchers want to conduct meta-analyses or reviews of different studies and choose among a pool of published studies on the specific subject they might get biased results, as happened when the WHO wanted to evaluate the effectiveness of complementary and alternative medicine, and found it favorable. This was later was criticized by Ernst (and others) stating that it did not taking into consideration studies which had found negative results on CAM, and studies which had not used RCTs with placebo controls. There is a need for a more comprehensive system where research (all studies, also those that were not in favour of your hypothesis) should be published in a database in order to be peer-reviewed, receive recognition etc so one can have more easy access to them later. This seems to be a problem in many different areas when conducting meta analyses, so a better system should be proposed (or if already, found a way to be implemented).
Secondly, I think there should be a much greater emphasis on stating the quality of studies when they are published, and also making it easier for the general public to understand different qualities/rating systems of the studies (and not only highly educated researchers themselves). I know different rating systems are used, but it still seems that it is not directed to the general public, journalists and the media etc. I think it should be much clearer.
That the FDA and other similar organizations are not government funded, but have to survive partly through sponsorships (some of which are drug companies themselves) also raises big problems in what types of studies are conducted, and which types of studies are published and not. Irving Kirsch has illustrated this problem very well in his book “The Emperor’s New Drugs, exploding the antidepressant myth” discussing the effect (or lack there of) of antidepressant medications, and showing how drugcompanies have tried to conceal the apparently large placeboeffect.
And lastly, there should be more focus on clinical significance, and not only statistical significance in studies where this is relevant.
Madeleine Dalsklev
Thank you for hosting this conversation — I think it’s critically important. Regardless of whether it’s right or fair that psychology is suddenly attracting all this negative attention for the misbehavior of two people, I expect it to continue as long as the press smells blood in the water. That’s a perfect reason to get the house in order.
My two cents: replication and the ability to publish negative findings are essential building blocks of solid science. Psychology has neither. I understand why this is so, historically and practically, but I think it is a severe liability in selling the field as a true science. It is common to claim that science is self-correcting because fluke findings will inevitably be exposed as such. In psychology this is simply not true.
I have a Ph.D. in cognitive psychology, but essentially left the field immediately after graduate school, in part because of the existential angst these issues inspired (I’m not joking). I now work as a data analyst for a research group in an allied field and am pursuing an MS in statistics; the upside is that I get to think about these issues as part of my job, and not as a distraction that will prevent me from keeping my job (which would have been the case had I tried to get tenure anywhere). I’m still a proud APS member, though; I love this society’s embrace of change and its willingness to confront the major issues of the day head-on. I look forward to seeing what comes of this discussion.
Replication of research is important, but not so much when it’s done by the original researcher(s). Indeed, what is to stop the charlatan from inventing a whole series of experiments? Replications should come from other independent labs. We should receive funding for such studies, and journals in which the original research was published should always consider publishing these replications whatever their outcome. Note that these measures will not expose the charlatan, but it will limit the immediate impact of science fraud and in the process makes psychology more resistant to spurious findings.
The next stage of our science will reward generating data in good studies as more important than generating interesting papers to explain a few samples. Others have commented on the value of replication and of collecting data associated with studies.
Instead, we should be encouraging researchers to generate good studies in response to common questions. Instead of requests of published and unpublished studies for meta-analyses, theorists should be able to go to a data-base and access a series of studies.
Imagine if APS published a list of 10 research questions each spring. Across the nation, faculty and students at could prepare for their coming research projects by choosing to study one of these questions through conceptual replication. At the end of the project, the collected data, along with a brief APA style report focused on methods, could be submitted to a web-portal. Additionally, participants could be invited to complete other online data measurements such as personality, attitudes, or demographic information, thereby providing even richer data. Under an ideal system, the highest quality projects would receive awards and stipends.
We publish too many papers that too few people read. Instead we should be generating more behaviorally focused data that broadly measures human existence and let many people analyze the data.
When warning against the specialization of science Karl Popper (1958) wrote “For me, both philosophy and science lose all their attraction … when they become specialisms and cease to see, and to wonder at, the riddles of the world.” However, while in 1958 these specializations had been driven by research questions, today they are primarily driven by the publication market. When asked why somebody is doing this or that experiment, the answer is quite often “because it is easy to publish and will gain a high impact factor”.
To counteract the impact of the bibliometric approach, the German Research Foundation now asks grant applicants to submit their 5 most important articles. There has been an outcry of reviewers who claim that only the bibliometric approach is objective and fair. However, from the point of view of measurement theory bibliometrics is mostly “measurement by fiat”, that is, no meaningful metric.
As president of the University of Regensburg (2001 – 2009) I have often encountered documents from search committees which consisted mainly of computations (Hirsch index, citsation indices etc.) resulting in a ranking of candidates without arguing about the content and inherent quality of the research of the candidates. In one case, the reviewer based his negative judgment on the fact that “the cadidate is not intelligent enough to cite himself”.
This situation in science reminds me of the derivatives market – before the bubble burst. Science is about “understanding the world in which we live” (Popper) and not about getting published and cited – that is fashion.
.
In case of important findings editors should ask for an independent replication before publishing any so-called break-through.
APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines.
Please login with your APS account to comment.