Scale Development: EFA and CFA Steps

Scale development is the multi-stage psychometric process of turning an abstract construct — burnout, digital literacy, organisational commitment — into a set of measurable items. The mistake reviewers reject most often is compressing the whole process into a single factor analysis. A defensible scale runs as a chain: item pool, content validity, exploratory factor analysis (EFA), and then confirmatory factor analysis (CFA) on a separate sample. This guide sets out the decision rules for each link at thesis and journal standard.

Item pool and content validity

Begin with a generous item pool grounded in theory and qualitative groundwork (interviews, focus groups); writing at least twice as many items as you intend to keep is sound practice. The pool then goes to a panel of subject and measurement experts, whose judgements are quantified with the content validity ratio (CVR): for each item, CVR = (number of experts rating it essential − half the panel size) / (half the panel size). Items falling below the critical value for the panel size are dropped or revised, after which a small cognitive pre-test checks that respondents read the items as intended. When drafting items, avoid double-barrelled statements, keep reverse-coded items to a minimum, and fix the response format (typically a 5- or 7-point Likert scale) from the outset.

Before EFA: Sampling adequacy checks

Two conditions must be reported before factoring. The KMO measure of sampling adequacy should be at least 0.60; values of 0.80+ are good and 0.90+ excellent. Bartlett's test of sphericity must be significant (p < 0.05) — otherwise the correlation matrix is not factorable. For sample size, the common working rule is 5–10 participants per item, and an EFA sample around 300 reassures most reviewers as an absolute floor. Missing-data and outlier screening also belongs to this stage.

EFA in scale development: Extraction, retention, rotation

For extraction, principal axis factoring is robust to departures from normality, while maximum likelihood (ML) offers fit statistics and confidence intervals when multivariate normality holds. Principal components analysis (PCA) is, strictly speaking, a data-reduction technique rather than a latent-variable model, so it should not be the default in scale work.

On the number of factors, leaning on the eigenvalue > 1 (Kaiser) rule alone is the best-known error: it systematically over-extracts. Current practice triangulates three sources of evidence: the break in the scree plot, parallel analysis (retain factors whose observed eigenvalues exceed those generated from random data), and theoretical interpretability. For rotation, choose varimax (orthogonal) only if factors can plausibly be assumed uncorrelated; in the social sciences they almost never are, so promax or oblimin (oblique) is the sensible default. If the post-rotation factor correlations fall below 0.30, an orthogonal solution may be reported as well.

Loading threshold: Drop items whose primary loading is below 0.40 (some fields accept 0.32 as the floor).
Cross-loadings: Where an item loads on two factors, the gap between loadings should be at least 0.20; otherwise the item is ambiguous.
Communality: Items with communalities below 0.30 deserve scrutiny.
One item at a time: Re-run the EFA after each removal; deleting items in bulk distorts the solution.

EFA loadings for a six-item subscale (illustrative): Item 5 is borderline, Item 6 is dropped.

CFA: Separate sample and fit indices

The structure recovered by EFA cannot be confirmed by running CFA on the same data — that is circular evidence. The ideal design either collects a fresh sample or randomly splits one sufficiently large sample in two. The CFA is specified in AMOS, lavaan (R) or Mplus, with each item loading only on its own factor, and standardised loadings are reported alongside the fit indices below. If the model is rejected, modification indices should be acted on only with a theoretical rationale — for example, correlating the errors of two items that share a wording stem. For choosing between estimation traditions and software, see our SEM guide.

Match the estimator to the data: maximum likelihood is adequate for roughly normal Likert items with five or more categories, while items with four or fewer categories, or marked skew, call for robust estimators of the weighted least squares (WLSMV) family. Remember that the χ² test flags even trivial misfit in large samples, so the verdict should rest on the full set of indices below rather than any single value.

CFA fit indices and threshold values
Index	Acceptable	Good fit
χ²/df	≤ 3	≤ 2
RMSEA	≤ 0.08	≤ 0.05
CFI	≥ 0.90	≥ 0.95
TLI	≥ 0.90	≥ 0.95
SRMR	≤ 0.08	≤ 0.05

Construct validity and reliability: AVE, CR, alpha

After the CFA, convergent validity requires an average variance extracted (AVE) of at least 0.50 and a composite reliability (CR) of at least 0.70 for each factor; where AVE dips slightly below 0.50, convergent validity can still be defended if CR remains at or above 0.70. For discriminant validity, the classical criterion is that the square root of each factor's AVE exceeds that factor's correlations with the others (the Fornell-Larcker criterion); recent work adds the HTMT ratio as a complementary check. For reliability, Cronbach's alpha ≥ 0.70 is the familiar benchmark, but since alpha assumes equal loadings, reporting McDonald's omega alongside it is now expected. When writing up, follow the conventions in our guide on reporting statistics in APA 7.

If the scale will be used to compare groups — gender, culture, mode of administration — the final step is to test measurement invariance through multi-group CFA at the configural, metric and scalar levels; group means cannot be compared until scalar invariance holds. Presenting every item-removal decision from pool to final form in a single flow table, with reasons, also speeds up peer review considerably.

A scale is not born where items survive; it is born where the structure is proven twice, in two independent samples.

Frequently Asked Questions

How large a sample do I need for scale development?

The working rule is 5–10 participants per item, with around 300 for the EFA and a separate 200–300 for the CFA as a safe target. EFA and CFA must not share participants; a single large sample can be randomly split in two.

Can EFA and CFA be reported in the same paper?

Yes — in fact it is expected, provided the two analyses use different samples. Papers typically present a two-study design: Study 1 for the EFA and Study 2 for the CFA.

Must an item loading below 0.40 always be deleted?

No, the threshold is a starting point for a decision, not a verdict. A theoretically indispensable item loading between 0.32 and 0.40 can be retained, with an explicit justification and close monitoring of its behaviour in the CFA and reliability analyses.

What scale development support does Celsus provide?

We support item pool and expert panel design, CVR calculation, EFA with parallel analysis, CFA on an independent sample, AVE-CR and reliability analyses, and write-ups formatted for theses and journals. Deliverables include the full SPSS, R or AMOS output.