When is a model identified




















In other words, for a statistical model to be identified, we need to have information available which indicates that there is one best value for each parameter in the model whose value is not known. Identification is made possible based on information about the distribution of the observed x and y variables.

This FAQ explains some conceptual and statistical methods which relate to identification and assumes that the user has knowledge of structural equation modeling and the matrix algebra used in structural equation models. In order to understand some of the concepts behind identification, it is necessary to understand some basic definitions and the equations which are used to solve for parameter estimates e.

The rules that follow are for models with observed variables. Known parameters : parameters that are known to be identified--the variances and covariances of these parameters have consistent sample estimators which are uniquely estimable. In the above equation, there are an infinite number of solutions for x and y e.

These values are therefore "underidentified" because there are fewer "knowns" than "unknowns". For researchers, models which are just identified yield a perfect fit, which is not really meaningful and thus makes the test of this model's fit uninteresting. An overidentified model occurs when every parameter is identified and at least one parameter is overidentified e.

Typically, most people who use structural equation modeling prefer to work with models that are overidentified. An overidentified model has positive degrees of freedom and may not fit as well as a model which is just identified. Imposing restrictions on the model when we have an overidentified model provides us with a test of our hypotheses, which can then be evaluated using the Chi-square statistic and fit indices.

The positive degrees of freedom associated with an overidentified model allows the model to be falsified with a statistical test. When an overidentified model does fit well, then the researcher typically considers the model to be an adequate fit for the data.

The following example of a structural equation model should help make the concept of identification more clear.

This example appears in Rigdon It can be visually represented as a confirmatory factor analysis model with a single latent variable, x1, and separate error variance estimates, d 1and d2, for each of the two observed variables, x1 and 2. The following equations algebraically represent the figure:. This model is not identified. As stated earlier, the knowns in structural equation modeling are based on information from the distribution of the x and y variables, or the variances and covariances of the measured variables, while the unknowns consist of model parameters.

If we count the number of known, identified variables, we have two variances the variances of X1 and X2 , and one covariance Cov[X1,X2]. So, we have three known pieces of information. How many unknown parameters are we trying to estimate using these three known pieces of information? The model has two error variances d1 and d2 , two factor loading paths X1-x1 and X2-x1 , and one factor variance x1. This means that the model has five unknown parameters to estimate based on three known pieces of information.

Therefore, the model is not identified. To move the model from an underidentified state to an identified condition, it is necessary to impose additional constraints on the model. If we set the scale of the latent variable x1 to 1. The following figure represents these changes:. Since the number of known pieces of information now equals the number of unknown parameters we wish to solve for, the new model is just identified.

When you impose identifying constraints on your model, the constraints should be consistent with your theoretical predictions. The following equation is a general representation of structural equations with observed variables taken from Bollen , pp. The z term represents random errors in the relationships between the X's an y's and is sometimes referred to as errors in the equations. The standard assumption is that the errors z are uncorrelated with X.

Thus, for models with observed variables, x and y are assumed to exactly represent the latent h and x variables and therefore only one indicator is used for each variable. The parameters whose identification status is unknown are in q, where q contains the t free and nonredundant constrained parameters of B beta , G gamma , f phi , and Y psi. If an unknown parameter in q can be written as a function of one or more elements in S, then that parameter is identified.

If all of the unknown parameters in q are identified, then the model is identified. Identification is not related to your sample size. For example, a model is not considered underidentified because one doesn't have enough cases. The population covariance matrix is the source of identified information and the parameters refer to the population, not to sample values. So, no matter how big your sample size is, an unidentified parameter still remains unidentified.

Model identification occurs when you place restrictions on your model parameters. For example, if a researcher were to free all of the elements in the B, G, f, and Y to see what relations were significant, this model would not run because it would be underidentified. For our data to be meaningful and tell us about associations, we must restrict certain parameters and free others.

Most commonly, people set elements in the B, G , f , or Y matrices to zero. These include between-group and within-group variance comparisons, which are typically associated with ANOVA. It also includes path analysis regression analysis whereby equations representing the effect of one or more variables on others can be solved to estimate their relationships.

Factor analysis is another special case of SEM whereby unobserved variables factors or latent variables are calculated from measured variables. These analyses can usually be performed using data in the form of means or correlations and covariances i. These data, moreover, may be obtained from experimental, nonexperimental and observational studies.

All of these techniques can be incorporated into the following example. Several symptoms of a disease are measured and used in a factor model that represents these symptoms. The impact of different types of medication on the factor s is then compared across the measured behavioral and environmental conditions.

To conduct the above analyses, both a structural i. The structural model refers to the relationships among latent variables, and allows the researcher to determine their degree of correlation calculated as path coefficients. That is, path coefficients were defined by Wright , p. Each structural equation coefficient is computed while all other variances are taken into account. Thus, coefficients are calculated simultaneously for all endogenous variables rather than sequentially as in regular multiple regression models.

To determine the magnitude of these coefficients, the researcher specifies the structure of the model. This is depicted in Figure 1. As shown, the researcher may expect that there is a correlation between variables A and B, as shown by the double headed arrow. There may be no expected relationship between variables A and C, so no arrow is drawn. Finally, the researcher may hypothesize that there is a unidirectional relationship of variable C to B, as indicated by an arrow pointing from C to B.

The relationships among variables A, B, and C represent the structural model. Researchers detail these relationships by writing a series of equations, hence the term 'structural equation' referring to the relationships between the variables.

The combination of these equations specifies the pattern of relationships [ 12 ]. The second component to be specified is the measurement model. As represented in Figure 1 , it consists of the measured variables e. Latent variables are factors like those derived from factor analysis, which consist of at least two inter-related measured variables. They are called latent because they are not directly measured, but rather are represented by the overlapping variance of measured variables.

They are said to better represent the research constructs than are measured variables because they contain less measurement error. As indicated in Figure 1 , for example, measurement model A depicts a latent variable A, which is the construct underlying measured variables 1 and 2. To further explicate the process of developing and analyzing a model, the following steps are outlined next. Why should we use SEM? Pros and cons of structural equation modeling. Meth Psychol Res Online , 8 The researcher develops hypotheses about the relationships among variables that are based on theory, previous empirical findings or both [ 15 ].

These relationships may be direct or indirect whereby intervening variables may mediate the effect of one variable on another. The researcher must also determine if the relationships are unidirectional or bidirectional, by using previous research and theoretical predictions as a guide.

The researcher outlines the model by determining the number and relationships of measured and latent variables. Care must be taken in using variables that provide a valid and reliable indicator of the constructs under study. The use of latent variables is not a substitute for poorly measured variables. A path diagram depicting the structural and measurement models will guide the researcher when identifying the model, as described next. Identifying the model is a crucial step in model development as decisions at this stage will determine whether the model can be feasibly evaluated.

For each parameter in the model to be estimated, there must be at least as many values i. This problem also occurs when variables are highly intercorrelated multicollinearity c , the scales of the variables are not fixed the path from a latent variable to one of the measured variables must be set as a constant , or there is no unique solution to the equations because the underidentification results in more parameters to be estimated than information provided by the measured variables.

In underidentified models there are an infinite number of solutions and therefore no unique one. These problems may be remedied with the addition of independent variables, which requires that the model be conceptualized before data are collected. There are many further issues to consider when managing parameters that cannot be addressed in this primer.

For further details on model identification, readers are encouraged to see Kline [ 16 ]. There are many estimation procedures available to test models, with three primary ones discussed here.

ML is set as the default estimator in most SEM software. It is an iterative process that estimates the extent to which the model predicts the values of the sample covariance matrix, with values closer to zero indicating better fit. The name maximum likelihood is based on its calculation. The estimate maximizes the likelihood that the data were drawn from its population. The estimates require large sample sizes, but do not usually depend on the measurement units of the measured variables.

It is also robust to non-normal data distributions [ 17 ]. Another widely used estimate is least squares LS , which minimizes the sum of the squares of the residuals in the model. LS is similar to ML as it also examines patterns of relationships, but does so by determining the optimum solution by minimizing the sum of the squared deviation scores between the hypothesized and observed model. It often performs better with smaller sample sizes and provides more accurate estimates of the model when assumptions of distribution, independence, and asymptotic sample sizes are violated [ 18 ].

The third, asymptotically distribution free ADF estimation procedures also known as Weighted Least Squares are less often used but may be appropriate if the data are skewed or peaked.

ML, however, tends to be more reliable than ADF. This method also requires sample sizes of to to obtain reliable estimates for simple models and may under-estimate model parameters [ 16 , 19 ]. For further details see Hu et al. These estimation procedures determine how well the model fits the data. Fitting the latent variable path model involves minimizing the difference between the sample covariances and the covariances predicted by the model. The population model is formally represented as:.

This simple equation allows the implementation of a general mathematical and statistical approach to the analysis of linear structural equation system through the estimation of parameters and the fitting of models.

Estimation can be classified by type of distribution multinormal, elliptical, arbitrary assumed of the data and weight matrix used during the computations. The function to be minimized is given by:. W is the weight matrix that can be specified in several ways to yield a number of different estimators that depend on the distribution assumed.

Essentially the researcher attempts to represent the population covariance matrix in the sample variables. Then, an estimation procedure is selected, which runs through an iterative process until the best solution is found. Another source of information in the output is the fit indices. There are many indices available, with most ranging from 0 to 1 with a high value indicating a great degree of variance in the data accounted for by the model [ 21 ].

A good fit is also represented by low residual values e. This statistic, however, varies as a function of sample size, cannot be directly interpreted because there is no upper bound , and is almost always significant. It is useful, however, when directly comparing models on the same sample. Dahly, Adair, and Bollen [ 22 ], for example, tested various fit indices for different models depicting the relationship between maternal height and arm fat area with fetal growth.

When adding and removing variables, as well as specifying varying relationships between variables, each corresponding fit index was calculated. This allowed the researchers to determine factors in the fetal environment that are most significantly related to systolic blood pressure of young adults. A comparison of indices was conducted by Hu and Bentler [ 18 ] on data that violated assumptions of normal distribution, independence of observations, and symmetry.

Many of these are provided by standard SEM software packages e. To determine the model's goodness-of-fit, sample size is an important consideration. Even if we had infinite data we could never hope to recover these values. Sign up to join this community.

The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group. Create a free Team What is Teams? Learn more. How can I tell if a statistical model is "identified"? Ask Question.

Asked 5 years, 6 months ago. Active 4 years, 6 months ago. Viewed 2k times. My Question Can someone clarify what he means by "information" and "pin down"?

The covariance of a variable with itself is the variable's variance. Free Parameters or Unknowns in a Structural Model Standard specification: paths, covariances between the exogenous variables, between the disturbances and between exogenous variables and disturbances, and variances of the exogenous variables and disturbances of endogenous variables less the number of linear constraints.

For the path diagram, the number of unknowns is 10, 5 paths, 1 curved line, 2 exogenous variable, and 2 endogenous variances.

Path analytic specification: paths not including the disturbance paths and correlations between the exogenous variables, between the disturbances, and between the exogenous variables and the disturbances less the number of linear constraints. For the path diagram, the number of unknowns is 6, 5 paths and 1 curved line.

Constraints Setting of a parameter equal to some function of other parameters. The simplest constraint is to set one parameter equal to another parameter. Zero constraints are usually not counted. For more information on constraints. Degrees of Freedom of a Model The numbers of knowns minus the number of free parameters; used in many measures of fit.

The degrees of freedom can be viewed the number of independent over-identifying restrictions. Just-identified or Saturated Model An identified model in which the number of free parameters exactly equals the number of known values, i. Note that not all models in which the knowns equal the unknown are identified and so these models are not identified.

The example model is just-identified. Note that the number of knowns exactly equals the number of unknowns. Under-identified Model A model for which it is not possible to estimate all of the model's parameters. For some under-identified models, some parameters are identified. Over-identified Model A model for which all the parameters are identified and for which there are more knowns than free parameters.

An over-identified matrix places constraints on the correlation or covariance matrix. Over-identifying Restriction A constraint on the variance-covariance matrix of observed variables. For instance, it might be the case that two covariances are equal to each other although usually the constraint is much more complicated.

Very often an over-identifying restriction can be thought of the constraint that results when two estimates of an over-identified parameter are set equal.



0コメント

  • 1000 / 1000