‘Identification’ in CFA and SEM

Identification refers to the idea that a model is ‘estimable’, or more specifically whether there is a single best solution for the parameters specified in the model. An analogy would be the ‘line of best fit’ in regression - if we could draw two lines that fit the data equally well then our method doesn’t enable us to choose between these possibilities, and is essentially meaningless (or uninterpretable, anyway).

This is a complex topic, but David Kenny has an excellent page here which covers identification in lots of detail: http://davidakenny.net/cm/identify.htm. Some of the key ideas to takeaway are:

  • Feedback loops and other non-recursive models are likely to cause problems without special attention.

  • Latent variables need a scale. To do this either fix their variance, or fix a factor loading to 1.

  • You need ‘enough data’. Normally this will be at least 3 measured variables per latent. Sometimes 2 is enough, provided the errors of these variables are uncorrelated, but you may struggle to fit models because of ‘empirical under-identification’9

  • If a model is non-identified, it may either i) fail to run or, worse, ii) produce spurious results.

Rule B

For structural models, ‘Rule B’ also applies when deciding when a model is identified: No more than one of the following statements should be true about variables or latents in your model:

  • X directly causes Y
  • Y directly causes X
  • X and Y have a correlated disturbance
  • X and Y are correlated exogenous variables

But see http://davidakenny.net/cm/identify_formal.htm#RuleB for a proper explanation.

  1. Note, indicators themselves should be correlated with one another in a bivariate correlation matrix. It’s only the errors which should be uncorrelated.