Beyond p-values

Stop counting stars in tables and start asking how big the effect is

We know the p-value often steals the spotlight in research world— yet it's one of the most misunderstood and misused concepts. In the recent months, I have developed my awareness by reading some articles on what p-values truly represent, how they’re meant to be used, and why current practices often fall short, misleading the scientific community. More importantly, what can what is a healthier practice to report a scientific data. Here is a summary of what I understood from reading these articles:

At its core, a p-value is a measure that helps us determine whether the results of an experiment are significant. Specifically, it tells us the probability of obtaining results at least as extreme as the ones observed, assuming the null hypothesis is true. The null hypothesis typically states that there is no effect or no difference in the experiment. For example, if we are testing different types of Fertiliser, the null hypothesis might be that there is no difference on Fertiliser types and crop yield. Imagining we conduct an experiment to test two types of Fertiliser on crop yield and find that Fertiliser A average yield is 50 Kg per acre for and Fertiliser B is 45 Kg per acre. After running statistical tests, you obtain a p-value of 0.04. This p-value suggests that there is only a 4% chance of observing such a difference (or a more extreme one) if the null hypothesis were true. Therefore, because 0.04 is less than the conventional threshold of 0.05, you might conclude that Fertiliser A is more effective. But here's the catch — this interpretation oversimplifies and often misleads. A p-value does not tell us the probability that the null hypothesis is true or false. It also doesn’t indicate the size or practical significance of the effect. It merely helps assess whether the observed effect could be due to random chance. This distinction is subtle but critical.

The 0.05 Trap- One of the main issues with p-values is the arbitrary threshold of 0.05 for statistical significance. This cutoff promotes a binary mindset where results are deemed either significant or not significant, ignoring the nuances of the data. For instance, a p-value of 0.049 is considered significant, while a p-value of 0.051 is not, despite being nearly identical in meaning. This rigid threshold can lead to important findings being overlooked or trivial findings being overemphasized based solely on their p-value.

Even worse, the reliance on p-values encourages questionable research practices, commonly known as p-hacking. Researchers may manipulate data collection or analysis methods until they achieve a p-value below 0.05, thereby "discovering" significant results that may not be truly meaningful. This practice undermines the reliability of scientific literature and contributes to the replication crisis, and we know many published findings cannot be reproduced in subsequent studies. Who to blame — sole use of p-value?

So, what do we do? How can we improve the way we report and interpret scientific research? One key approach is to complement p-values with other statistical measures that provide more context and information. Effect sizes and confidence intervals are two such measures that can significantly enhance the quality of reporting.

Effect size quantifies the magnitude of the difference or relationship observed in the study. For example, knowing that Fertiliser A increases yield by 5 Kg per acre on average provides concrete information about its effectiveness, which a p-value alone cannot offer. Confidence intervals, on the other hand, give a range of values within which the true effect is likely to lie. They provide insights into the precision and uncertainty of the estimate. A narrow confidence interval indicates high precision, while a wide interval suggests more uncertainty. Lets consider our Fertiliser example again. Suppose the effect size is an increase of 5 Kg per acre with a 95% confidence interval of 2 to 8 Kg per acre. This interval tells us that we can be 95% confident that the true effect of the Fertiliser lies somewhere between 2 and 8 Kg per acre. Even if the p-value were slightly above 0.05, this interval would provide valuable information about the potential benefits of the Fertiliser.

So, should we ditch p-values entirely? Absolutely not! They're still useful tools when used correctly. But it's time we stopped treating them like the be-all and end-all of research. Complementing them with a more nuanced approach that includes effect sizes, confidence intervals,– actual interpretation of what our results mean in the real world.

Here are some common misinterpretations of p-values

“p is the probability the null hypothesis is true.”
Wrong. A p-value is the probability of getting results as extreme as, or more extreme than, what you observed if the null model and all its assumptions were true.
“p is the probability the result was produced by chance alone.”
Wrong. The calculation already assumes chance alone under the null. It does not tell you how likely “chance alone” produced your data.
“p ≤ 0.05 means the null is false and should be rejected.”
Not guaranteed. A small p can arise from assumption violations, bias, model misspecification, measurement error, or random flukes. It indicates data are less compatible with the null model, not that the null is definitively false.
“p > 0.05 means the null is true and should be accepted.”
Wrong. Large p indicates the data are more compatible with the null model, but you may simply lack precision or power. “Failure to detect” is not “evidence of no effect.”
“Non-significant means no effect was observed.”
Misleading. Unless the point estimate is exactly the null value, an effect was observed; it just was not precise enough to rule out the null. Always show the estimate and its confidence interval.
“Non-significant means the effect size is small.”
Not necessarily. Large effects can look non-significant in small or noisy samples. Precision, not only magnitude, drives p.
“p = 0.05 means the observed result would occur 5% of the time.”
Not quite. It refers to the probability of results as or more extreme than yours under the null, not the probability of your exact result.
“If p ≤ 0.05 and I reject H0, the chance I’m wrong is 5%.”
Incorrect. 5% is a long-run Type I error rate across many repetitions under the null, not the probability your specific decision is wrong. The false positive risk depends on prior plausibility and study power.
“p = 0.05 and p < 0.05 mean the same thing.”
They are not identical. Exact p-values convey more information than thresholded statements. Avoid “borderline” narratives; treat p as a continuum.
“Reporting p as inequalities (p < 0.02, p > 0.05) is fine.”
Poor practice. Report the exact p-value when possible, and avoid “p = 0.000.” If extremely small, state “p < 0.001” and provide the test statistic.
“Statistical significance is a property of the phenomenon.”
No. It is a property of the test result relative to an arbitrary cutoff. Effects exist or not; “significant” is our label.
“Always use two-sided p-values.”
Not always. One-sided tests can be appropriate if a direction is pre-specified and only one direction is meaningful. Decide and justify before seeing data.
“Multiple non-significant studies mean there is no effect.”
Not necessarily. Combined evidence (meta-analysis, e.g., Fisher’s method) can be significant even when each small study is not.
“Two studies with p on opposite sides of 0.05 are conflicting.”
Maybe, but check comparability: populations, designs, assumptions, and confidence intervals. Differences in power or bias can explain the discrepancy.
“Two studies with similar p-values are in agreement.”
Not guaranteed. Similar p can mask very different effect sizes and precisions. Compare effect estimates and confidence intervals.
“A small p validates the model assumptions.”
No. p-values do not diagnose confounding, selection bias, or model misfit. In observational studies, assumption checks and sensitivity analyses are essential.

Better habits that prevent these errors

Report the effect size and 95% confidence interval first; give p-values as supporting detail.
Treat p as a compatibility measure with the null model, not a truth machine.
Pre-specify the alpha level and sidedness; avoid post-hoc switching.
Discuss assumptions and potential biases and do sensitivity analyses.
Avoid bright-line thinking; interpret evidence on a spectrum rather than “significant/non-significant.”

Feel free to use my Shinyapp here which can help you report the effect size and 95% confidence interval along with p-values

_________

Author: Jamil Chowdhury

Image by: DALL-E 3

12 July 2024

References:

Button, K. S., et al. (2013). "Power failure: why small sample size undermines the reliability of neuroscience." Nature Reviews Neuroscience.
Cohen, J. (1994). "The earth is round (p < .05)." American Psychologist.
Gardner, M. J., & Altman, D. G. (1986). "Confidence intervals rather than P values: estimation rather than hypothesis testing." British Medical Journal.
Greenland, S., et al. (2016). "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations." European Journal of Epidemiology.
Halsey, L. G., et al. (2015). "The fickle P value generates irreproducible results." Nature Methods.
Nuzzo, R. (2014). "Scientific method: statistical errors." Nature.
Sullivan, G. M., & Feinn, R. (2012). "Using effect size—or why the P value is not enough." Journal of Graduate Medical Education.

Page updated

Google Sites

Report abuse