For decades, a p value below 0.05 was treated as the finish line of an analysis — except that a p value never tells you how large a difference is, only how unusual the observed data would be if the null hypothesis were true. An effect size answers the question your research actually asks: how strong is the difference or relationship in practical terms? APA 7 reporting standards and most journal author guidelines now require an effect size, ideally with a confidence interval, alongside every test. This guide maps Cohen's d, Hedges' g, η², ω² and r onto one coherent picture.
Why p values alone mislead
The core problem is that p depends directly on sample size. A worked example: on a 100-point scale, two groups score means of 62.1 and 61.5 (standard deviation 20). That 0.6-point gap corresponds to Cohen's d = 0.03 — practically nothing. With 10,000 participants per group, t ≈ 2.12 and p ≈ 0.034: the difference is "statistically significant". The very same d = 0.03 with 100 per group yields p ≈ 0.83. The effect is equally trivial in both cases; only the sample changed. Given a large enough N, even the most negligible difference becomes significant.
The reverse holds too: in a pilot study with 20 participants per group, a near-moderate effect of d = 0.45 produces t ≈ 1.42 and p ≈ 0.16. Failing to reach significance does not demonstrate the absence of an effect; the study is simply underpowered. This is why significance and effect size must always be reported together, and why sample size planning in G*Power should start from the effect size you expect to detect.
The three effect size families: d, variance and correlation
The d family: Cohen's d and Hedges' g
Cohen's d divides the difference between two means by the pooled standard deviation: d = (M₁ − M₂) / pooled SD. A d of 0.50 means the groups sit half a standard deviation apart. In small samples, d systematically overstates the effect; Hedges' g removes that bias with the correction factor J ≈ 1 − 3/(4df − 1). With 10 participants per group the correction is roughly 4%; reporting g is standard practice whenever group sizes fall below about 20, and g is also the default metric in meta-analysis.
The variance-explained family: η², partial η² and ω²
In the ANOVA family, η² (eta squared) expresses the percentage of total variance explained by an effect: the sum of squares for the effect divided by the total sum of squares. In multi-factor designs, the value SPSS reports is partial η², which removes the other factors' variance from the denominator — so it comes out larger than classical η², and the two must never be interpreted interchangeably. Both are positively biased: the smaller the sample, the more they inflate the effect. ω² (omega squared) adjusts for error variance and is a noticeably less biased estimator, particularly in small samples; reviewers increasingly ask for it by name.
The correlation family: r and R²
Pearson's r summarises the direction and strength of a linear relationship between two continuous variables on a scale from −1 to +1; its square, R², gives the proportion of variance explained. An r of 0.30 looks "medium", yet R² = 0.09: only 9% of the variance. For model contributions in regression, Cohen's f² = R²/(1 − R²) is used. Non-parametric tests such as the Mann–Whitney U convert onto the same scale via r = Z/√N.
Converting between families
- r = d / √(d² + 4) for two equal-sized groups; in the other direction, d = 2r / √(1 − r²).
- In a two-group design, η² equals r²; hence d = 0.50 ≈ r = 0.24 ≈ η² = 0.06 — the three families are different dialects describing the same effect.
- Cohen's f = √(η² / (1 − η²)). G*Power's ANOVA module asks for f while SPSS outputs η²; skipping this conversion invalidates the power analysis.
Which effect size for which test?
The table below pairs the most common tests with the recommended effect size measure and Cohen's classic benchmarks. If you are still choosing the test itself, see our statistical test decision guide.
| Test | Effect size | Small | Medium | Large |
|---|---|---|---|---|
| Independent / paired t-test | Cohen's d, Hedges' g | 0.20 | 0.50 | 0.80 |
| One-way / factorial ANOVA | η² (preferably ω²) | 0.01 | 0.06 | 0.14 |
| Pearson correlation | r | 0.10 | 0.30 | 0.50 |
| Multiple regression | f² | 0.02 | 0.15 | 0.35 |
| Chi-square test of independence | Cramér's V (φ for 2×2) | 0.10 | 0.30 | 0.50 |
| Mann–Whitney U / Wilcoxon | r = Z/√N | 0.10 | 0.30 | 0.50 |
Cohen's benchmarks and the field-dependence critique
Cohen's 0.20 / 0.50 / 0.80 thresholds are a practical starting point, but Cohen himself offered them as a last resort, to be used only when no better yardstick exists. Carrying them across disciplines is misleading: in education research, a typical year of schooling yields around d ≈ 0.40, so an intervention effect of that magnitude is remarkable rather than merely "medium". In personality psychology, r = 0.30 sits near the upper end of what is realistic; in epidemiology, relationships with massive population consequences may hover around r = 0.05. The proper benchmark is the distribution of effects reported in meta-analyses and comparable studies within your own field — state explicitly in your discussion section where your finding falls within that distribution.
A p value tells you an effect exists; an effect size tells you whether anyone should care.
Confidence intervals and reporting requirements
An effect size is itself a sample estimate with uncertainty, so report the point value together with a 95% confidence interval. Intervals for d and η² are computed from non-central distributions; the R packages effectsize and MBESS, as well as JASP and jamovi, produce them automatically, whereas SPSS usually needs additional syntax. A result such as d = 0.45, 95% CI [0.02, 0.88] indicates the likely direction of the effect but says almost nothing about its size — and that should be written up honestly.
A model reporting format: t(58) = 2.31, p = .024, d = 0.60, 95% CI [0.08, 1.12]. Missing effect sizes have become one of the most frequent reasons thesis committees and journals send work back; building them in at the analysis stage saves you an entire revision round. If the request has already arrived from a reviewer, the strategies in our reviewer response guide will help.
Frequently Asked Questions
What counts as a good effect size?
There is no universal threshold. Cohen's benchmarks (0.20/0.50/0.80 for d) are only a starting point; the correct interpretation compares your value against typical effects in meta-analyses and similar studies in your own field. For a low-cost intervention, even a small effect can be practically valuable.
Is the partial eta squared SPSS reports the same as eta squared?
No. They coincide in a one-way ANOVA, but in multi-factor designs partial eta squared removes the other factors' variance from the denominator and therefore comes out larger. State clearly which one you report, and check that cross-study comparisons use the same measure.
What does a negative Cohen's d mean?
Only the direction of the difference: the second group's mean exceeds the first's. Magnitude is interpreted using the absolute value; in the write-up, simply make explicit which group the difference favours.
What effect size support does Celsus provide?
We provide end-to-end support: computing the appropriate effect sizes for your tests, reporting them with confidence intervals in line with APA 7, justifying sample sizes with G*Power, and supplying missing effect sizes during peer review revisions.