How ‘AI’ Could Lead to a Rise in Research Slop
GenAI may make statistical abuse much easier to implement.
Nominal News is an economics newsletter written by a PhD Economist that translates the latest economic research into clear, policy‑relevant insights on current issues. Join 4,000 readers to stay-up-to-date with Nominal News directly in your inbox:
At Nominal News, we often talk about economics research that establishes causality. For example, how do immigrants impact innovation or how do free trade agreements impact people’s incomes and life expectancy. This causality is established using statistical methods, specifically, linear regression analysis.
Linear regression is an extremely powerful tool – nearly every existing economic policy can be linked to a research paper that uses linear regression to evaluate it. However, this tool can also be very easily abused to make it seem like there is ‘causality’ when there isn’t. The rise of genAI can make it much easier to manipulate data and present false research.
Linear Regression and Statistical Significance
Linear regression is a statistical method that tells us about the relationship between two variables. For example, how does educational attainment impact income; how do vaccines impact survival; how does adding a one road lane change congestion.
To properly conduct linear regression analysis, a researcher should follow the following steps:
Stipulate a hypothesis – e.g. a vaccine prevents the disease;
Establish the “null” hypothesis – e.g. the vaccine does not prevent the disease;
Collect data;
Undertake a linear regression analysis;
Conclude by rejecting or failing to reject the null.
The final step is what determines whether we have an outcome that is ‘causal’ or not: since if we ‘reject the null’, that means there is evidence for the alternative hypothesis – i.e. there is evidence that the vaccine does prevent the disease.
How Do We Reject/Fail to Reject the Null
So how exactly do we decide to reject/fail to reject the null hypothesis? Initially, we start off under the assumption that the null hypothesis is true – in our example, the vaccine does not prevent the disease. Once we collect the data, we perform the linear regression analysis. This analysis tells us how statistically likely we were to collect the data we did IF the null hypothesis is true. The probability of observing the data we did assuming the null hypothesis is true is called a “p-value”.
In our example, suppose based on the data collected, many of the vaccine recipients did not get the disease, while many non-vaccinated individuals did get the disease. Then, if the null hypothesis were true – i.e. the vaccine does not prevent the disease - the probability of collecting such data would be very low, and we would get a low p-value.
The decision to reject/fail to reject the null hypothesis depends on this p-value. If the p-value is lower than 5%, we reject the null, if it is greater, we fail to reject. It is worth pointing out that there’s no mathematical reason for this 5% cutoff – it is simply a scientifically agreed upon consensus. A p-value of 5% tells us that, if the null were true, there is a 5% probability that we would have observed the data numbers that we did.
Since 5% is assumed to be a low, unlikely event, we ‘reject the null’, meaning that our starting null hypothesis is rejected. In the case of the vaccine example, we would reject that hypothesis that the vaccine does not prevent disease. When we reject the null, it is often referred to as a ‘statistically significant’ result. It is important to note, however, that around 5% of the time we will wrongly claim statistical significance. That is, 5% of the time, we will have collected data that will make it look like there is a ‘causal relationship’ (statistically significant) between the variables in question, but, in fact, there is no such relationship.
‘P-Hacking’
This 5% cutoff is often crucial for researchers, as researchers cannot claim that they found a result – for example, that the vaccine reduces disease - unless the p-value is lower than 5%. Naturally, this creates an incentive to get one’s results below 5% by potentially manipulating the data. Any such actions are colloquially referred to as ‘p-hacking’.
The most obvious way to “p-hack” is to simply delete or change the data collected in such a way to get the p-value below 5%. This is a clear case of falsifying data. These types of ‘p-hacking’ have occurred, with a recent high-profile scandal at Harvard involving exactly this type of data manipulation. However, there are also more subtle ways to ‘p-hack’.
Iterating on the Collected Data
One way to ‘p-hack’ on data is to look at subsets of the collected data. For example, focusing only on certain demographic groups within the collected data might yield a ‘p-value’ of less than 5%, which would then allow the researcher to say that they found a statistically significant result.
In our case, suppose the vaccine data did not result in a p-value of 5%, but if we were to look at people aged 18-29, we would get a p-value of 5%. Thus, researchers might say that the vaccine works on people aged 18-29.
But here is the problem: if we look at all possible demographic sub-groups – of which there can be hundreds – we are bound to find one which will have a p-value of 5% or less. The researcher will present this finding and claim to have found a statistically significant result. But you are bound to find one such sub-group, because, by statistical design of the 5% p-value, 5% of the time, we will see data, which we will incorrectly conclude as statistically significant, because it has a p-value of less than 5%!
Why ‘AI’ May Make P-Hacking Easier
GenAI may make such p-hacking much easier. As genAI can be used to do data analysis, as well as create computer code, it will be much easier to find subsets of data that will satisfy the p-value of 5%. This will then be presented as novel research finding, rather than ‘p-hacking’.
Everything I mentioned today was also possible prior to genAI of course. But the rise and ease of doing this thanks to genAI may significantly increase the amount of research that is p-hacked.
How To Detect P-Hacking
Detecting p-hacking is difficult. Let’s start with what a good study should do:
Explain why a hypothesis may make sense in the first place: for example, if a vaccine should work differently on people of different ages, what prior research or theory do we have that would support this idea.
Pre-register the hypothesis: prior to running an experiment or collecting data, record the hypothesis publicly (like at the Open Science Framework).
Make data available: all analysis and data from the study is publicly available and reviewable.
The above three elements give a significant boost of confidence that the research results are legitimate. Using the above points, some p-hacking red flags are:
Statistically significant outcomes that do not seem intuitive or lack theory backing the hypothesis;
Outcomes that appear to be confined to an oddly specific sub-group.
In a famous example, John Bohannon intentionally created a p-hacked research study titled ‘Chocolate with High Cocoa Content as a Weight Loss Accelerator’, which showed that eating chocolate resulted in statistically significant weight loss. This study was reported on by global media that did not question the outcome (or look into the easy to spot methodological flaws, as the study had only 15 participants!). Later, Bohannon said that he intentionally created this fake study to show how easily false conclusions may disseminate.
The Incentives of P-Hacking
The reason p-hacking occurs is due to the pressures researchers have in a ‘publish-or-perish’ world. Researchers, in order to get published, need to often show statistically significant results, as ‘null’ results (or statistically insignificant results) are treated as ‘uninteresting’ (even though they shouldn’t be, since a lack of a statistically significant result is just as important for science).
Until the incentives in academia are fully corrected to discourage ‘p-hacking’, situations with ‘p-hacking’ will occur and erroneous research will be picked up by the media. We can be vigilant when reading these headlines by being just a bit more skeptical when something does not seem right.
If you would like to support us in reaching our subscriber goal of 7,000 subscribers, please consider sharing this article and pressing the like❤️ button at top or bottom of this article!


“Could”?? It already has and will only get worse. AI is that unsupervised intern that only does exactly what you tell it, not what it knows you meant. It’s malicious compliance with MFLOPS.
That’s not to say it can’t be useful, just that it needs to be kept on a VERY tight leash.