The effect of misssing data handling methods on goodness of fit indices in confirmatory factor analysis

The primary objective of this study was to examine the effect of missing data on goodness of fit statistics in confirmatory factor analysis (CFA). For this aim, four missing data handling methods; listwise deletion, full information maximum likelihood, regression imputation and expectation maximization (EM) imputation were examined in terms of sample size and proportion of missing data. It is evident from the results that when the proportions of missingness %1 or less, listwise deletion can be preferred. For more proportions of missingness, full information maximum likelihood (FIML) imputation method shows visible performance and gives closest fit indices to original fit indices. For this reason, FIML imputation method can be preferred in CFA.


INTRODUCTION
Educational and psychological scientists have improved their ability to carry out quantitative analysis on large and complex data bases with the use of computers.The goal of the researcher is running the most precise analysis of the data for making acceptable and effective deductions about the population (Schafer and Garham, 2002).Generally, scientists have ignored or have underestimated some kind of research problems by reason of missing data but with the help of improved technology and computers, this problem can be handled easily.The term missing data means that some type of interested information about the phenomena is missing (Kenny, 2005).Missing data is one of the most common problems in data analysis.The problem occurs as a result of various factors.Equipment errors, reluctant respondents, researcher goofs can be given as an example for these factors.Quantity and pattern of missing data determines it's seriousness for the research (Tabachnick and Fidell, 2001).
Researches on missing data dates as far back as the 1930's.At first, a maximum likelihood method for imputation of missing data in bivariate normal distributions was suggested by Wilks (1932: cited in Cheema, 2012).Several methods such as linear regression were introduced between the 1950's and 1960's.However, absence of statistical softwares caused little progress for how to handle missing data at that time.In 1980's and 1990's, developed computer packages made handling mising data easy for resear-chers (Cheema, 2012).
It is possible to face to missing data in behavioral sciences.In all research studies, reporting necessary information for missing data should be done.Resarchers should report the extent and nature of missing data and the procedures for how to handle the missing data (Schlomer et al., 2010).Academic journals expect from the authors to take appropriate steps to properly handle missing data, but most articles do not give necessary attention to this isue (Sterner, 2011)."Small"percentages of missing values are less problematic but there is no common definition of "small amount of missing data" in the literature (Saunders et al., 2006).
Researchers should also take into consideration the pattern of missing data as well as amount and source of missing data.Occasionally, pattern of missing data can be called as missing data mechanism.It is essential to remind here that the word "mechanism" is used as a technical term.It gives point to structural association with the missing data and the observed and/or missing values of other variables in the data without emphasizing the hypothetical primary reason of these assoications (Kenny, 2005).Missing completely at random (MCAR), missing at random (MAR), and not missing at random (MNAR) are three patterns of missingness (Schlomer et al., 2010;Cheema, 2012).

Missing completely at random
Missing values do not have any relationship with any variable being examined and missing values randomly distributed throughout data.In other words, probability of missing on Y does not depend on neither X or Y (Schafer and Garham, 2002;Acock, 2005;Schlomer et al.,2010;Sterner, 2011).Mathematically this can be presented as .
For example, a student does not finish a test because of his or her instantaneous health problem.

Missing at random
With MAR, missing data may be related to at least one variable in the study but not to the outcome being measured (Schafer and Garham, 2002).It can be stated as .
For example, it could be difficult for an elderly person to finish the questionnaire by reason of age (a measured Köse 209 variable) but not because of his or her level of depression (the outcome being measured) (Saunders et al., 2006).

Missing not at random
With MNAR, the reason for missingness is related to one or more of the outcome variable or the missingness has a systematic pattern (Schafer and Garham, 2002).In mathematical base, it can be formulated as; ).
It means that something which you have not measured as an determinative factor for the possibility that an observation is missing (Davey and Salva, 2010;Molenberghs and Kenward, 2007).For example, if the participants don't think there is a progress from the treatment, they may give up a study on depression and not complete the final questionnaire.
Statistical results can be more affected by the pattern of missing data than the percentage of missingness.The pattern of missing data basicly based on randomness of missing values.Randomness is less problematic than nonrandomness.Because non randomness affects the generalizability of results (Tabachnick and Fidell, 2001).
All measures, less or more, contain some measurement error inevitably.Statistical analysis results and conclusions drawn from these results are affected by measurement errors.Missing data result to either measurement error or sampling error, depending on how the missing data are handled (Mackelprang, 1970).Researchers can handle missing data with the help of various methods which have different effects on estimation and and decisions made on the basis of these estimations.There are various methods which are available to handle the missing data problem.These methods can be classified as "old methods" and "new methods".Old methods require less mathematical computations.In contrast to old methods new methods based on more complex mathematical computations.(Saunders et al., 2006).

Deletion Methods
Listwise deletion: Listwise deletion also called as a complete case analysis is the method which is the most wide spread and easiest of all to handle the missind data (Schafer and Garham, 2002;Acock, 2005;Enders, 2001;Enders, 2013).Computer program automatically discharges missing cases from the data when listwise deletion is used.This procedure reduces sample size and it causes statistical power reduction and researchers should ask the representativeness of remaining sample.

Pairwise deletion:
Pairwise deletion is very similar to listwise deletion but this method discharges missing cases only in the analysis.For instance, if there are three variables as X, Y and Z and missing case is on Z variable.Correlation will use all n observations to calculate r XY but only n-1 observations to calculate r XZ and r YZ (Cheema, 2012).

Imputation methods
Mean substitution: This is easy and fast method in which all missing cases is substituted by the mean of total sample (Saunders et al., 2006).This method has some disadvantages practically.Usage for respondents at the extremes can cause misleading results.Rich and poor persons would not want to give their incomes in a telephone survey.If mean of the population is substituted for this maissing part, it would be spurious guess (Acock, 2005).Mean substitution doesn't change variable mean but it can be used only if the missing pattern is MCAR.
But it has reducing effect on variance and causes biased and deflated errors (Pigott, 2001;Tabachnick and Fidell, 2001).

Regression
imputation or conditional mean imputation: Regression imputation is more complicated method and based on regression equations.Selected predictors (highest correlations) are used as independent and missing data is used as dependent variables.With the help of repeated regression equations missing values are predicted (Saunders et al., 2006;Peugh and Enders, 2004).This method's advantage is objectiveness against researcher's guess.Two disadvantages can be given for this method.First, predictions from other variables for the regression equations cause beter fit than the real score.Second, it causes reducing variance (Tabachnick and Fidell, 2001).These methods are known as the conventional methods in the literature and they produce biased estimates of parameters or their standard errors.EM and multiple imputation are new methods that have much better statistical properties (Allison, 2003).

Maximum likelihood (Ml):
ML is known as modern method that utilizes information from other variables during parameter estimation procedure by incorporating information from the conditional distribution of observed variables.There are three maximum likelihood estimation algorithims; the multiple group approach, full information maximum likelihood (FIML) estimation and expectation maximization algorithim.All three algorithims assume multivariate normality (Enders, 2001).
The multi-group approach is difficult to implement and stipulates an exceptional level of expertise.For this reason this approach is not widespread among researchers but the multiple group approach can be conducted in all structural equation modeling (SEM) softwares.FIML can be seen as similar to the multiple-group method but likelihood function is calculated at the individual level, rather than the group level (Enders, 2001).Amos and Mx offer FIML.FIML was recommended as a superior method for dealing with missing data in structural equation modeling.Specifically, Enders and Bandalos (2001) pointed out that FIML estimates are unbiased and efficient under MCAR and MAR mechanisims.The third ML algorithim is expectation-maximazation algorithim.EM estimates missing data values on the likelihood under that distribution.EM is an iterative procedure and includec two steps; expectation (E) and maximization (M) for each iteration (Tabachnick and Fidell, 2001).With E step, the conditional expectation of the parameter is calculated on missing data.It is conducted by a series of regression equations (Enders, 2001).With M step the parameters by maximizing the complete data likelihood are estimated (Jing, 2012).Statistical package for the social sciences (SPSS), estimation of means and covariances (EMCOV) and NORM offer EM algorithm.Because of the accessibility for SPSS, EM algorithim was selected.
Multiple imputation: MI is an another complicated missing data handling method in which a number of imputed data sets (frequently between 5 and 10) are created and in these data sets different estimation of missing values are available.These parameter estimations for missing values are avareged to produce a single set of results (Peugh and Enders, 2004;Rose and Fraser, 2008).Researchers have been recommended maximum likelihood and multiple imputation method because of their less restrictive assumptions, strong theorethical assumptions, less biased results and greater statistical power (Enders, 2013).Enders and Bandalos (2001) stated that maximum likelihood and multiple imputation gives accurate estimations under MCAR and MAR mechanism.As missing data are frequently encountered in behavioral, psychological research much attention has been given to analyze structural equation models in the presence of missing data.

Confirmatory factor analysis
The main importance in the development and use of measurement instruments is the degree to which they do measure that which they meant to measure.That is to say sturctures are valid.Confirmatory factor analysis (CFA) is among the most important methodological approaches in order to analyze for the validity of factorial structures or within the framework of SEM (Byrne, 2001).
CFA is used for evaluating and testing the hypothesized factor structure of scores obtained from various measurement instruments and relations among latent constructs (for example, attitudes, traits, intelligence, clinical disorders) in counseling and education (Sun, 2005;Jackson et al., 2009).CFA generates various statistics to explain how well the competing models explained the covariation among the variables of fit the data.These statistics are called as "fit statistics" (Gillapsy, 1996).The correspondence between hypothesized latent variable models is quantified by fix indexes (Hu and Bentler, 1995).Multiple imputation, rarely used for SEMs, is a method which is used for handling missing data but EM method is commonly used as missing data imputation and based on ML estimation.EM performs well under different conditions in simulation studies (Zhang, 2010).Regression imputation estimates missing values unbiased in the case of the data are MCAR (Tannenbaum, 2009).
Determination of relations among variables or constructs has a crucial importance in the measurement approaches.These constucts can be affected by many threats.Threats to reliability will automathically affect to construct validity.Missing data is one of the threats to internal consistency reliability (Kenny, 2005).A measure's internal consistency reliability is defined as the degree of ture score variation relative to observed-score variation: It can be seen from the equation, as the error variance increases, reliability decreases.With respect to missing data, lost information can give rise to larger amounts of error variance.Missing data has negative effects on research results such as contribution to biased resultsand making it difficult to make valid and efficient inferences about a population, decreasing statistical power and finally cause violation statistical assumptions (Schafer and Garham, 2002;Kang et al., 2005;Kenny, 2005;Tannenbaum, 2009;Rose and Fraser, 2008).There has been considerable interest in the effect of the missing data handling methods on structural equation modeling (Enders and Bandalos 2001;Chen et al., 2012;Enders, 2001;Allison, 2003).Considering the aforementioned research, missing data handling methods have effect on goodness of fit statistics.Specifically, this study specifically asks which missing data handling method works best on goodness of fit sttistics in CFA when proportion of missing data and sample size are known?

METHODOLOGY Data simulation
The primary source of data used for statistical anlysis performed in this study was a simulated dataset.R-studio program was used to generate data sets.Reason for using simulated data was that it is difficult to satisfy all of the assumptions under experimental conditions such as different sample sizes ranging from very small to very large with data missing at different rates in these samples.A second reason for using simulated data is that since we start with complete dataset, it is relatively straightforward to observe the effect of missing data on goodness of fit statistics by comparing results directly between complete and incomplete datasets.This allows one to objectively evaluate how much of error can be corrected by using a particular missing data method.
Four datasets (1 to 0) with 25 replications were simulated which included 10 continuous variables (Table 1).These 10 continuous variables have a multivatiate normal distribution.For ease of interpretation all variables were specified to have a mean of 0 and standard deviation of 1.
Each of these subsamples was then reduced in size by 1, 5, 10 and 20% in order to simulate datasets containing missing data.The cases were discarded randomly from each complete samples separetely in order to make sure that there were no dependencies between samples.For example, 10 cases were randomly thrown out from a sample size n=100 in order to obtain a partial sample containing 10% missing data, n=90.In order to obtain a sample with 20% missing data, 20 cases were randomly removed from the original sample of n=100 again rather than removing 10 additional cases from the n=90 sample.

Method of analysis
Missing data handling methods, listwise deletion, regression imputation, EM imputation, and FIML were applied to all samples containing missing data under CFA.The main consideration behind the choice of CFA was its widespread use among educational and psychological researchers.The data used in this study is simulated data.MCAR pattern can be produced by randomly discharding cases.

RESULTS
Results of analytical procedures described in the methods section for the simulated data are presented in this section.In order to see the effects of missing data handling methods, goodness of fit indices were calculated separately for each original data samples first (Table 2).
Before looking at the relative performance of various When sample size is 100, for all percentages of missing, GFIs' were presented in Table 3.The figures in Table 3 show some important results.When proportion of missing data is 1%, listwise deletion method has shown visible performance.Namely, If the proportion of missing data is 1% or smaller, missing data cases can be discarded from the dataset.Because, this proportion of missingness has no effect on GFI in confirmatory factor analysis.
When proportion of missing data is 5% or more FIML method shows beter fit indices than orther methods.Besides this, listwise deletion method works worse than orther methods beacuse CFA works well in large samples.
When sample size is 200, for all percentages of missing, GFIs' were presented in Table 4.The visible performance of EM and FIML imputation methods can be seen in all proportions of missing data.When proportion of missing data is 1%, listwise deletion method works well.Also, it is worthwhile to point here that regression imputation method produces the worst fit indices compared to other missing data handling methods.When sample size is 500, EM and FIML imputation methods are the best missing data handling methods because they produce more acceptible goodness of fit indices compared to other missing data handling methods.
Listwise deletion method works well under circumstances of 1% percentage of missing data (Table 5).
When sample size is 1000, for all percentages of missing, GFIs' were presented in Table 6.The prominent performance of FIML imputation methos can be seen in all proportions of missing data.FIML imputation method is the best missing data handling method because it produces more acceptible goodness of fit indices.When proportion of missing data is 1%, as well as FIML imputation method, listwise deletion method works well.Also, it is important to point here that regression imputation method generates the worst fit indices compared to other missing data handling methods.

DISCUSSION AND CONCLUSIONS
The primary objective of this study was to examine the effect of missing data on goodness of fit statistics in SEM.For this aim, four missing data handling methods; listwise deletion, FIML, regression imputation and EM imputation were examined with sample size and proportion of missing data.Under the small sample and low missing data conditions, statistical results imply that listwise deletion is one of the simplest and least computationintensive methods.
Furthermore, listwise deletion method is definitely not recommended for CFA analysis, if the sample size is large and missingness proportion is high.Decreasing sample size by listwise deletion has negative effect on fit indices.

Table 1 .
Summary of sample sizes used in missing data analysis.

Table 2 .
Goodness of fit indices for each sample sizes.