Beyond p-values

We know the p-value often steals the spotlight in research world— yet it's one of the most misunderstood and misused concepts. In the recent months, I have developed my awareness by reading some articles on  what p-values truly represent, how they’re meant to be used, and why current practices often fall short, misleading the scientific community. More importantly, what can what is a healthier practice to report a scientific data. Here is a summary of what I understood from reading these articles:

At its core, a p-value is a measure that helps us determine whether the results of an experiment are significant. Specifically, it tells us the probability of obtaining results at least as extreme as the ones observed, assuming the null hypothesis is true. The null hypothesis typically states that there is no effect or no difference in the experiment. For example, if we are testing different types of Fertiliser, the null hypothesis might be that there is no difference on Fertiliser types and crop yield. Imagining we conduct an experiment to test two types of Fertiliser on crop yield and find that Fertiliser A average yield is 50 Kg per acre for and Fertiliser B is 45 Kg per acre. After running statistical tests, you obtain a p-value of 0.04. This p-value suggests that there is only a 4% chance of observing such a difference (or a more extreme one) if the null hypothesis were true. Therefore, because 0.04 is less than the conventional threshold of 0.05, you might conclude that Fertiliser A is more effective. But here's the catch —  this interpretation oversimplifies and often misleads. A p-value does not tell us the probability that the null hypothesis is true or false. It also doesn’t indicate the size or practical significance of the effect. It merely helps assess whether the observed effect could be due to random chance. This distinction is subtle but critical.

The 0.05 Trap- One of the main issues with p-values is the arbitrary threshold of 0.05 for statistical significance. This cutoff promotes a binary mindset where results are deemed either significant or not significant, ignoring the nuances of the data. For instance, a p-value of 0.049 is considered significant, while a p-value of 0.051 is not, despite being nearly identical in meaning. This rigid threshold can lead to important findings being overlooked or trivial findings being overemphasized based solely on their p-value.

Even worse, the reliance on p-values encourages questionable research practices, commonly known as p-hacking. Researchers may manipulate data collection or analysis methods until they achieve a p-value below 0.05, thereby "discovering" significant results that may not be truly meaningful. This practice undermines the reliability of scientific literature and contributes to the replication crisis, and we know many published findings cannot be reproduced in subsequent studies. Who to blame — sole use of p-value?

So, what do we do? How can we improve the way we report and interpret scientific research? One key approach is to complement p-values with other statistical measures that provide more context and information. Effect sizes and confidence intervals are two such measures that can significantly enhance the quality of reporting.

Effect size quantifies the magnitude of the difference or relationship observed in the study. For example, knowing that Fertiliser A increases yield by 5 Kg per acre on average provides concrete information about its effectiveness, which a p-value alone cannot offer. Confidence intervals, on the other hand, give a range of values within which the true effect is likely to lie. They provide insights into the precision and uncertainty of the estimate. A narrow confidence interval indicates high precision, while a wide interval suggests more uncertainty. Lets consider our Fertiliser example again. Suppose the effect size is an increase of 5 Kg per acre with a 95% confidence interval of 2 to 8 Kg per acre. This interval tells us that we can be 95% confident that the true effect of the Fertiliser lies somewhere between 2 and 8 Kg per acre. Even if the p-value were slightly above 0.05, this interval would provide valuable information about the potential benefits of the Fertiliser.

So, should we ditch p-values entirely? Absolutely not! They're still useful tools when used correctly. But it's time we stopped treating them like the be-all and end-all of research. Complementing them with a more nuanced approach that includes effect sizes, confidence intervals,– actual interpretation of what our results mean in the real world.

_________

Author: Jamil Chowdhury

Image by: DALL-E 3

12 July 2024

References: