P-Values & Statistical Significance: What They Mean

Updated 2026-01-18

Summary: P-values measure the probability of observing your data if the null hypothesis (treatment does not work) were true, not the probability the treatment works. Statistical significance (p < 0.05) indicates strong evidence against the null hypothesis but does not prove the treatment is effective, important, or will work in future studies—multiple independent replications reaching the same conclusion provide stronger evidence. Common misconceptions—that p < 0.05 means 95% probability of effectiveness, that p > 0.05 proves no effect, or that lower p-values indicate larger or more important effects—lead to misinterpretation; always pair p-values with confidence intervals and effect sizes to assess both statistical precision and practical meaning. Single studies with p < 0.05 can be statistical flukes; trustworthy evidence emerges from multiple studies reaching similar conclusions, large sample sizes producing narrow confidence intervals, and results showing both statistical significance and practical importance.

This research article explains what p-values and statistical significance actually measure, reveals common misconceptions, and teaches you to interpret research results correctly.

Understanding the P-Value

A p-value is a number between 0 and 1 that measures the strength of evidence against a “null hypothesis.”

What Is the Null Hypothesis?

The null hypothesis is the assumption that nothing works—that there is no real difference between groups.

Examples of null hypotheses:

“BPC-157 does not improve healing compared to placebo”
“Peptide X does not increase strength compared to no treatment”
“TB-500 does not reduce inflammation compared to saline”

The null hypothesis is the default assumption. To claim a treatment works, researchers must show strong evidence against this assumption.

What Does a P-Value Actually Measure?

A p-value answers this specific question: “If the null hypothesis is true (the treatment does not work), what is the probability of observing results at least as extreme as what I actually found?”

Important: The p-value does NOT tell you the probability that the null hypothesis is true. It tells you something different—the probability of your data given that the null hypothesis is true.

Understanding P-Value Numbers

P-value interpretation:

P = 0.05 (5%): If the treatment did not work, there is a 5% chance you would see results this extreme just by random variation
P = 0.01 (1%): If the treatment did not work, there is only a 1% chance you would see results this extreme
P = 0.50 (50%): If the treatment did not work, there is a 50% chance you would see results this extreme (weak evidence against null)
P > 0.05: Typically considered “not statistically significant”—weak evidence that the treatment works

The P < 0.05 Threshold

Scientists conventionally set p < 0.05 as the cutoff for statistical significance.

This means: if p < 0.05, researchers conclude the result is statistically significant—strong enough evidence against the null hypothesis.

But why 0.05? Historical convention. It is not magical; it is simply a widely agreed-upon standard.

Some fields use:

P < 0.05 (very common)
P < 0.01 (more stringent, stronger evidence)
P < 0.001 (very stringent, very strong evidence)

What Statistical Significance Actually Means

When a result is “statistically significant” (p < 0.05), it means:

“Under the assumption that the null hypothesis is true, results this extreme would occur by random chance less than 5% of the time. Therefore, the null hypothesis is unlikely to be true.”

This does NOT mean:

“The treatment definitely works”
“There is a 95% probability the treatment works”
“The effect is large or important”
“The result will happen again if repeated”

It only means: the observed difference is unlikely to be due to random variation alone.

Common Misconceptions About P-Values

Researchers, doctors, and journalists frequently misinterpret p-values. Here are the biggest mistakes:

Misconception 1: P < 0.05 Means 95% Probability the Treatment Works

Wrong: “This study shows p = 0.03, so there is a 97% chance the treatment works.”

Correct: P = 0.03 means “if the treatment did not work, there would be only a 3% chance of seeing results this extreme.” This does not tell you the probability the treatment works.

The probability that the treatment works depends on:

How plausible the treatment is theoretically
Whether previous studies support it
How well the study was designed
Reproducibility across multiple studies

A p-value alone cannot determine probability of effectiveness.

Misconception 2: P > 0.05 Means the Treatment Does Not Work

Wrong: “P = 0.07, so the treatment does not work.”

Correct: P = 0.07 means the evidence is weak—you cannot confidently reject the null hypothesis. But this does not prove the null hypothesis is true.

The treatment might still work. Possible reasons p > 0.05:

Small sample size (not enough power to detect the effect)
Study was too short to reveal effects
Real effect exists but is small (requires larger studies to detect)

Many treatments have real effects but do not reach p < 0.05 in small studies.

Misconception 3: P-Value Measures Effect Size

Wrong: “P = 0.001 is a much stronger effect than p = 0.05.”

Correct: P-values measure strength of evidence against the null, not the magnitude of the effect.

A tiny effect can have p < 0.001 if tested in a large enough sample. A huge effect can have p = 0.06 in a small sample.

Example:

Study A: 10,000 participants, BPC-157 improved healing by 1%, p = 0.001
Study B: 50 participants, BPC-157 improved healing by 30%, p = 0.08

Study A has a much smaller p-value but the effect is tiny (1%). Study B has a larger p-value but the effect is large (30%).

Study A shows stronger evidence of any effect existing. Study B shows a larger practical effect but weaker statistical evidence.

These are different things.

Misconception 4: Lower P-Value Means Better Research

Wrong: “P = 0.001 is a better study than p = 0.05.”

Correct: P-value depends on sample size and effect size. It does not measure study quality directly.

A poorly designed study with a huge sample might have p < 0.001 for a trivial effect. A well-designed small study might have p = 0.08 for a real, substantial effect.

Study quality depends on randomization, blinding, control of confounders, and absence of bias—not the p-value itself.

Factors That Affect P-Values

Factor 1: Sample Size

Larger samples produce smaller p-values for the same effect size.

Example:

Small study: 30 participants, 15% improvement in treatment group, p = 0.12 (not significant)
Large study: 3,000 participants, 15% improvement in treatment group, p = 0.0001 (highly significant)

Same effect size, but larger sample gives smaller p-value and statistical significance.

This means: large studies can detect small real effects, and small studies might miss large real effects.

Factor 2: Effect Size

Larger effects produce smaller p-values for the same sample size.

Example:

Small effect: 5% improvement in treatment group, p = 0.08 (not significant)
Large effect: 50% improvement in treatment group, p = 0.0001 (highly significant)

Same sample size, but larger effect gives smaller p-value.

Factor 3: Variability in Data

More consistent data produces smaller p-values.

If all participants in the treatment group improve by about 10%, and all in the control group stay the same, the difference is clear (small p-value).

If some treatment participants improve by 5%, some by 20%, some by 0%, the pattern is fuzzy (larger p-value even if average improvement is 10%).

P-Values in Context: The Importance of Multiple Studies

A single p-value from a single study is not strong evidence. Here is why:

The Multiple Comparisons Problem

When researchers test many different outcomes, some will reach p < 0.05 by chance alone, even if nothing real is happening.

Example: Imagine testing a peptide’s effect on 20 different measures:

Healing time
Inflammation
Pain
Strength
Range of motion
Recovery speed
(15 other measures)

If the peptide does nothing, you would expect about 1 of the 20 tests to reach p < 0.05 by random chance (5% × 20 = 1).

If a researcher reports only that one significant result and ignores the 19 non-significant ones, a false positive claim appears proven.

Why Replication Matters

A single statistically significant result, especially in a small study, may not be real.

The gold standard is multiple independent researchers:

1. Running similar studies

2. Getting similar results

3. All reaching p < 0.05

When this pattern emerges across many studies, you can be confident the effect is real, not a statistical fluke.

Statistical Significance vs. Practical Significance

Statistical significance (p < 0.05) does not equal practical importance.

Example 1: Statistically Significant but Practically Meaningless

Study: 50,000 people given either BPC-157 or placebo. BPC-157 group healed in an average of 14.1 days, placebo in 14.0 days.

Result: p = 0.001 (highly statistically significant)

Practical significance: A difference of 0.1 day (2.4 hours) is so small it is not clinically meaningful. No one would prescribe a peptide for a 2.4-hour improvement in healing time.

This is statistically significant due to the huge sample size, but practically useless.

Example 2: Not Statistically Significant but Practically Important

Study: 20 people given either BPC-157 or placebo. BPC-157 group healed in 10 days, placebo in 20 days.

Result: p = 0.10 (not statistically significant)

Practical significance: A 50% reduction in healing time is huge. However, the small sample size prevents reaching statistical significance. The effect might be real but this study is too small to prove it.

This is practically important but not statistically significant.

The Lesson

Always ask two questions:

1. Is the result statistically significant? (p < 0.05?)

2. Is the result practically meaningful? (Is the size of the effect important?)

Both questions matter. Statistical significance without practical significance is useless. Practical significance without statistical significance needs replication in larger studies.

Reading P-Values and Confidence Intervals in Papers

When you read research, you will see p-values reported with confidence intervals (ranges around the estimate).

Understanding Confidence Intervals

A confidence interval is a range of numbers around the main result.

Example: “BPC-157 improved healing by 15% (95% CI: 8–22%)”

This means:

Best estimate: 15% improvement
The true effect is likely between 8% and 22%
In 95% of repeated studies using the same methods, the true effect would fall within the confidence interval

Narrow CI: More precise estimate (good) Example: 15% improvement (95% CI: 13–17%)

Wide CI: Less precise estimate; more uncertainty (suggests small sample or inconsistent data) Example: 15% improvement (95% CI: 0–40%)

How to Interpret p-Value and CI Together

Scenario 1: Small p-value + Narrow CI

P = 0.01, Effect = 20% improvement (95% CI: 15–25%)

Interpretation: Strong evidence of a real, moderate-sized effect. The true effect is probably substantial.

Scenario 2: Small p-value + Wide CI

P = 0.02, Effect = 20% improvement (95% CI: 2–38%)

Interpretation: Strong evidence something real happened, but the true effect could be small (2%) or large (38%). Uncertainty about actual magnitude.

Scenario 3: Large p-value (not significant) + Wide CI

P = 0.20, Effect = 10% improvement (95% CI: -5–25%)

Interpretation: Weak evidence; the treatment might not work (-5%) or might work well (25%). Cannot determine which. Study is too small.

Scenario 4: Large p-value + Narrow CI

P = 0.30, Effect = 1% improvement (95% CI: 0.5–1.5%)

Interpretation: Strong evidence the effect is real but tiny. In a huge sample, even small effects become statistically significant but may not be clinically meaningful.

Type I and Type II Errors

Research errors come in two types:

Type I Error (False Positive)

Concluding the treatment works when it actually does not.

Probability of Type I error: This is what the p-value threshold controls. P = 0.05 means you are willing to accept a 5% chance of Type I error.

Example: Declaring BPC-157 works when it actually does not.

Type II Error (False Negative)

Concluding the treatment does not work when it actually does.

Probability of Type II error: Called “beta,” often set at 0.20 (20%). This is usually not reported in papers.

Example: Saying BPC-157 does not work when it actually does.

Why Both Errors Matter

A p < 0.05 threshold controls Type I error, but ignores Type II error.

A small study might have:

Low probability of Type I error (unlikely to find false positives)
High probability of Type II error (likely to miss real effects due to small sample)

This is why small negative studies (p > 0.05 concluding “no effect”) are not trustworthy. The study might simply be too small to detect the real effect.

Next articleEffect Size vs. Statistical Significance Explained