The Integrity of Psychological Research: Uncovering Statistical Reporting Inconsistencies

The APS podcast, Under the Cortex, logo

Accurate reporting in psychological science is vital for ensuring reliable results. Are there statistical inconsistencies in scientific articles?  

In this episode, APS’s Özge Gürcanlı Fischer Baum speaks with Michele Nuijten from Tilburg University to examine how overlooked errors in statistical reporting can undermine the credibility of research findings. Together, they discuss Nuijten’s research published in Advances in Methods and Practices in Psychological Science and examine practical strategies to enhance the quality of psychological research. 

Send us your thoughts and questions at [email protected] 

Unedited transcript

[00:00:10.160] – APS’s Özge Gürcanlı Fischer Baum 

Statistical reporting is a core part of writing scholarly articles. Many conclusions in scientific reports rely on null hypothesis significance testing, making accurate reporting essential for robust findings. What if there are inconsistencies in academic journals? How would it affect our field? If we cannot trust the numbers reported, the reliability of the conclusions is at stake. I am Özge Gürcanlı Fischer Baum with the Association for Psychological Science. Today, I have the pleasure of talking to Michele Nuijten from the Tilburg University. Michele recently published an article on Statistical Reporting Inconsistencies in APS’s Journal: Advances in Methods and Practices in Psychological Science. Join us as we explore the impact of these inconsistencies and discuss potential solutions to enhance the credibility of psychological research. Michele, welcome to Under the Cortex. 

[00:01:09.810] – Michele Nuijten 

Thank you very much. I’m honored to be here. 

[00:01:12.520] – APS’s Özge Gürcanlı Fischer Baum 

Please tell us about yourself first. What type of psychologist are you? 

[00:01:17.420] – Michele Nuijten 

I guess you could classify me as a methodologist. I have a background in psychological methods. That is what I studied. But right now, I think I would call myself a meta-scientist, so someone who researchers’ research. I’m really focusing on trying to detect problems in the way that we do science in psychology and related fields. If I uncover any issues, to also think about pragmatic solutions to make sure that we can move forward in our field in a better and more solid way. 

[00:01:49.360] – APS’s Özge Gürcanlı Fischer Baum 

Yeah, that is fantastic. How did you first get interested in becoming a meta researcher and studying statistical inconsistencies? 

[00:01:59.980] – Michele Nuijten 

It’s actually quite some time ago now. I was still in my master’s program, which is over 10 years ago now, I think. I was a master student in a very interesting time, I think, around 2011, 2012, which is also often marked as the start of the replication crisis, as it’s often called in psychology. We had the massive fraud case of Diederik Stapel, who, coincidentally, was a researcher at the university I’m working now. But there was also an article that seemingly proved that we could look into the future. That was the article that proved that if you have so much flexibility in data analysis, you can inflate your type 1 error to 50%. A lot of these things were happening. Around that time, I also got interested in reporting inconsistencies, mainly out of a more technical interest, I guess. People around me were working on these inconsistencies. Together with a friend, Sacha Epskamp, we thought, Well, this seems like a problem that you can automate. Maybe we can write a program to help us detect problems in articles. 

[00:03:06.220] – APS’s Özge Gürcanlı Fischer Baum 

Yeah, I will come to that. But it is interesting that we are contemporary. During that replication crisis, I was in grad school as well. I was a senior graduate student then. Us also had some replication problems in my field in developmental side. I totally hear you. I’m glad it created a research program for you. Now we have a tool called called statcheck. Can you tell our listeners what it is? 

[00:03:34.480] – Michele Nuijten 

I think the easiest way to explain what Stat Check is is to compare it to a spell checker for statistics. Instead of finding typos in your words, you find typos in your statistical results. What it does is it takes an article, it searches through the text for, as you mentioned, null hypothesis significance tests, so effectively tests with P values, and it tries to use the numbers to recalculate that P value. An equivalent would be if you would write down 2 plus 4 equals 5, then you know when you read that, that something is off, like these numbers don’t match up. Stat Checkdoes a similar thing. It searches for these test results, which often consists of a test statistic, degrees of freedom, and a P-Value. It uses two of these numbers, the test statistic and the degrees of freedom, to recalculate the P-Value and then see if these numbers match or not. 

[00:04:29.850] – APS’s Özge Gürcanlı Fischer Baum 

Yeah. This is an important tool because what I read from your study is that before Stat Check, your earlier work shows about 50% of articles with statistical results contained at least one P-value that didn’t match what the test statistic and the degrees of freedom would indicate it should be. How did you first notice there were problems in statistical reporting? 

[00:04:56.250] – Michele Nuijten 

This was actually a project that some of my colleagues were working on at the time, back when I was still a student, so my colleagues Marion Bacher and Jelte Wicherts, who are, incidentally, my colleagues still now at a different university, they noticed that one high-profile paper contained such inconsistency. They noticed just by looking at the numbers like, Hey, something doesn’t add up here. When they went through this particular paper, they thought, Well, if such a high-profile paper published in a high-quality journal already has some of these just visible errors in it, How much does this occur in the general literature? They actually went to the painstaking process of going through, I think, over a thousand P values by hand to see how often this occurred. When I and my friend saw that this was such a painstaking process, and ironically, also an error-prone process. You can imagine if you have to do this by hand, then we thought, Well, this seems like a thing that you can automate. That’s how we got on this path of looking at reporting inconsistencies in statistics. 

[00:06:04.350] – APS’s Özge Gürcanlı Fischer Baum 

Yeah, let’s talk about how it works then. I have this tool. Could you describe our listeners? What are the steps? We go to this website, and then what happens? 

[00:06:16.110] – Michele Nuijten 

Well, for the side of the user, it’s very straightforward. You go to the web application, which is called statcheck.io. There’s literally one button you can click on. The button is upload your paper. You upload a paper in a Word format or HTML or PDF. Nothing gets saved in the back-end. It only gets scanned by Stat Check, and you get back a nice table with all the results that Stat Checkwas able to find and a list of whether or not it was flagged as consistent. 

[00:06:49.010] – APS’s Özge Gürcanlı Fischer Baum 

In your work, you must have seen a lot of examples. What are some of the worst examples that you saw? 

[00:06:57.330] – Michele Nuijten 

Yeah, that’s a difficult question because Stat Check, in a way, it’s not an AI or something. It merely just looks at numbers and says, Well, these numbers don’t appear to add up. What it does is it recalculates the P-value because we had to choose a number to recalculate calculate, and given the enormous focus on P values in our field, that seemed like the most logical choice. But just as with my earlier example, if you say 2 plus 4 equals 5, the 5 could be incorrect, but the 2 or the 4 could also be incorrect. You don’t know. This also means that sometimes Stat Check flags an inconsistency where the reported P value is smaller than 0.001, but the recomputed P value is 0.80 or something. Which might look like a blatant error and a really dramatic difference. But it could be the case that there is a typo in the test statistic. For instance, if you write down that your T-value is 1.5, but you meant 10.5, it’s only a typo, but it seems as if it would have huge influence on your results. Those type of inconsistencies look dramatic, but might not be. At the other side, you also have types of errors that might, at first glance, seem inconsequential that might have big consequences. 

[00:08:21.260] – Michele Nuijten 

For instance, we have a lot of focus on this P must be smaller than 0.05 criterion to decide whether something is statistically significant. I do sometimes come across cases where the reported P-value is smaller than 0.05, and if I recompute it, it’s 0.06. In absolute terms, this is a very small difference, and you could argue that the statistical evidence that this P-value represents does not differ much. But I think it could signal a bigger underlying problem that people might round down P-values in order to increase their chances to get published. This is something that Stat Check cannot tell you. It only flags these numbers don’t seem to add up. It’s very hard to pinpoint what exactly the reason is. But with these particular types of inconsistencies, I do get a little bit suspicious, like what might What is going on here? 

[00:09:17.050] – APS’s Özge Gürcanlı Fischer Baum 

Yeah. One of the other things I noticed in your report is that there are a few articles that could be 100% wrong in a way that up to 100% of the reported results are inconsistent. How do you think that happens? 

[00:09:36.440] – Michele Nuijten 

Yeah, this signals a problem, or not really a problem, but a difficulty in reporting such prevalence of inconsistencies. Because at what level do you display this? The problem is that different articles have a different number of P-values they report. Sometimes articles only report one P-value. If they report it incorrectly, then they have a 100% inconsistency rate. But it could also be the case that people report 100 P values and 10 of them are wrong. The inconsistency rate would be different, but which one is worse? In absolute sense, you have more errors than the other, but yet you can also argue, Well, if you report a lot of P values, it’s easier to make at least one mistake. It’s hard to come up with a summarizing statistic that fairly reflects what is going on. Yeah. 

[00:10:30.580] – APS’s Özge Gürcanlı Fischer Baum 

Why did you decide to try to fix it in our field? 

[00:10:36.920] – Michele Nuijten 

It seems like such low-hanging fruit. I mean, it’s an issue that technically could be spotted by anyone. It’s just in the paper. It’s right there. Peer reviewers could spot it, but it turns out that they don’t, which makes a lot of sense as well, because we are all very busy. We’re often not that trained in statistics, especially not seeing inconsistencies with the naked eye. But I do think these type of errors or inconsistencies are important because, as you also mentioned at the start, if you cannot trust the numbers that a conclusion is based on, how can you trust the conclusion is correct at all? I think this type of reproducibility, I would call this. If I have the same data and I do the same analysis, I should get the same results. If you spot If I have an inconsistency in a paper, I can already tell you that that result is not reproducible. I cannot get to an inconsistent result based on your raw data. It’s very hard to judge to what extent the data is then trustworthy or the conclusions are It’s worth it. There’s quite a lot of issues going on right now in psychology, things that have been flagged as potential problems. 

[00:11:53.130] – Michele Nuijten 

I think this seems like one of the easiest things that we can solve. If we have a spell checker like this and we can just quickly We quickly run our manuscript through it before we submit it, we save both ourselves and the readers and the editors and everyone involved a lot of pain if we just managed to get out these errors beforehand and we don’t have to get into this annoying world of issuing corrections or just leaving the errors in there. 

[00:12:20.130] – APS’s Özge Gürcanlı Fischer Baum 

I’m really glad you said it is like spell check because I wrote down grammar check for statistical reports. This is what the tool does. Now, do you think journal editors use it or are they allowed to use it? 

[00:12:34.750] – Michele Nuijten 

It’s completely free and everyone is allowed to use it. I would encourage everyone to use it. It’s an R package underneath it. You can use the R package if you have research intentions, if you want to have larger sets of articles to scan. But if you just want to scan a single paper, go to the web app, go through it. It’s free. Within a second, you have your results. I would definitely encourage editors to use it. There are a few that do. For instance, psychological science, if we’re talking about APS journals. I’m not sure, but I think maybe Amps also mentioned something about it. I don’t have a curated list. People or editors that start using Stat Check usually don’t notify me. By the way, if you are an editor interested in using it, feel free to notify me or ask for help. I’m more than happy to assist in any way I can. But I think It’s a great use of the tool. 

[00:13:31.670] – APS’s Özge Gürcanlı Fischer Baum 

Yeah, that’s exactly why I’m asking it, to encourage people that they should use it. It is not AI. I’m glad you clarified that point. It is just a check. If we do grammar check or spelling check for text, we should be able to do it for numbers. This is a great tool that everybody can use and it is free. Let’s take a step back. What was the process of making stat check like? How long did it What was your team like? 

[00:14:03.470] – Michele Nuijten 

Well, I don’t think there will ever be an ending to this. It’s an ongoing project. I’ve been working on this for 10 years now. But I think the initial framework was set up in… Well, I think as it goes with tools like this, the first version is usually done within a day or within an hour by someone. In this case, Sasha Epskamp was the person who developed the first version of Stat Check. After that, I ran with it for the next 10 years to develop it further. There have been many, many updates, mainly behind the scenes. I learned a lot about software development in the process. I learned about unit testing. I learned about best practices on how to use GitHub and branches and all these terms that were new to me. That was a lot of fun to do. During the years, I’ve had many people contributing interesting ideas of people writing code for me. But mainly, I’ve kept it quite close because sometimes tools like this that point out mistakes feel a bit tricky. I have very I very much want to present Stat Check as something that can help improve everybody’s work as something that you can use yourself. 

[00:15:23.240] – Michele Nuijten 

Sometimes people don’t always see it like that. I’m a bit afraid to give it away to have more people develop to develop on it because I’m afraid that maybe mistakes will be introduced. This is a very big pitfall of mine. I really need to learn to let go and invite more people to work on it, especially because I think that many people will be a lot better at it than I am. But this is one of the things I’ve been struggling with a little bit. 

[00:15:52.630] – APS’s Özge Gürcanlı Fischer Baum 

Yeah, but it’s your baby. 

[00:15:54.580] – Michele Nuijten 

I know, yeah. 

[00:15:55.780] – APS’s Özge Gürcanlı Fischer Baum 

You want it to work better and better every single day. No, this is a great resource for everyone involved in our field. Thank you for all the hard work you put into that. Michele, is there anything else that you would like to share with our listeners? 

[00:16:16.370] – Michele Nuijten 

I think more in general about just improving practices in our field, because I think what I really like about Stat Checkand about the type of projects I usually take is I try to focus on things that are pragmatic, that are small steps towards a better science. I sometimes feel like it can be a bit overwhelming. The good news is there are so many initiatives to improve our field. I can imagine that, especially if you’re an early career researcher, that you don’t know where to start. I think that with these small tools like Stat Check, but many other initiatives are similar, just cherry-pick your favorite. Try one, see what happens. I think Christina Bergman calls this the buffet approach. You have this entire table full of open science practices, but you cannot eat them all at once. Just take some samples, try some stuff out, see what works for you and your paper, and in that way, get involved with the new developments. 

[00:17:26.150] – APS’s Özge Gürcanlı Fischer Baum 

Yeah. Thank you very much, Michele. This was a pleasure. Thank Thank you so much for joining Under the Cortex. 

[00:17:33.090] – Michele Nuijten 

Thank you for having me. 

[00:17:34.790] – APS’s Özge Gürcanlı Fischer Baum 

This is Özge Gürcanlı Fischer Baum with APS, and I have been speaking to Michele Nuijten from the Tilburg University. If you want to know more about this research, visit psychologicalscience.org. 


APS regularly opens certain online articles for discussion on our website. Effective February 2021, you must be a logged-in APS member to post comments. By posting a comment, you agree to our Community Guidelines and the display of your profile information, including your name and affiliation. Any opinions, findings, conclusions, or recommendations present in article comments are those of the writers and do not necessarily reflect the views of APS or the article’s author. For more information, please see our Community Guidelines.

Please login with your APS account to comment.