DIF analysis across genders for reading comprehension part of English language achievement exam as a foreign language

The purpose of this study is to carry out differential item functioning (DIF) analysis for content areas of a reading comprehension subtest using four area indices within Item Response Theory (IRT) framework. The differences in the magnitudes of the area indices were compared based on the subject areas. The DIF analysis was carried out across gender groups only. The item level data of the English reading comprehension subtest were gathered from English Language Achievement exam done in School of Foreign Languages, Ege University, Turkey, in 2013. A sample of 2,117 examinees (1,011 males and 1,116 females) was randomly selected. For the DIF analysis, (a) an IRT model for the item characteristic curves was specified; (b) model-data-fit was investigated for the selected IRT model;(c) Item characteristic curves were separately computed for each group on a common scale; finally, (d) indices indicating the degree of DIF on each item were computed. The results of the study indicated that both un-weighted and weighted area indices showed non-uniformity in DIF in the item characteristic curves in reading comprehension subtest in most cases. A significant correlation was observed between un-weighted and weighted area indices.


INTRODUCTION
A number of researchers have indicated that there are gender differences in achievement of verbal ability on items requiring inference or application, particularly for science-related content (Lawrence and Curley, 1989;Lawrence et al., 1988). For instance, males are likely to perform better than their female counterparts on verbal items related to natural sciences or technical content, whereas females are more successful than males on items in the social sciences, arts, or humanities. In Educ. Res. Rev. 2003;Payne and Lynn 2011). Payne and Lynn (2011) showed that females performed significantly better than males in second language reading comprehension when they were matched in all these variables. This suggests that females have a stronger module for second language processing than males do. Moreover, Pae (2004b) analyzed and showed that gender differences in reading comprehension items classified as logical inference were highly likely to favor males. As stated by Pae (2004aPae ( , 2012, conducting gender DIF studies of language tests will encourage language researchers to analyze various components that seem to contribute to the apparent gender difference in language tests. This will thereby provide valuable information for classroom teachers and their students, curriculum design of the educators, students' assessment of foreign language instructors, and ultimately administrators' decisions on language policy. For reasons explained above, identifying test itemswhich show differential item functioning in favor of any gender group-is an important issue. This is more so especially in the selection and placement of decisions for the purpose of allowing equal opportunity for persons who have equal ability. These persons may belong to a disadvantaged group, which is not taken into consideration in most cases. Since tests are increasingly used as devices for evaluation and placement, the test constructors must make every effort to remove any possible bias from the items. Statistical differential item functioning analysis techniques are frequently used to detect items that include DIF in a test or in a pool of questions in order to increase the test validity. Traditionally, item analysis procedures have been used to determine empirically whether items function as intended or not. Differential item functioning analysis detection techniques serve the same purpose but specifically focus on the validity of items for different subgroups that take the test (Ironson and Craig, 1982;Shepard, Camilli andWilliams, 1985: Maller, 2001 (Ironson and Craig, 1982;Hambleton and Rogers, 1988;Seong and Subkoviak, 1987;Lim and Drasgow, 1990). Comparison of (p) values (the mean item difficulty level) and item discrimination parameters (r) (that is, the extent at which an item is able to differentiate among individuals possessing different levels of the ability or trait underlying the test or scale), which are based on CTT, confounds DIF and true subpopulation differences in the trait. This is because this may lead to a high rate of incorrect conclusion that items exhibit DIF when they truly do not (Shepard et al., 1985;Lim and Drasgow, 1990). Furthermore, significant differences in p values may be a result of psychometrically desirable item discrimination power rather than real DIF. Finally, the use of p values may lead to a large number of type II errors by failing to detect items that truly function differently across groups (Lim and Drasgow, 1990). The other method that falls in the CTT techniques is chi-square approach (Scheuneman, 1987). This method assumes that examinees from different groups but with roughly equivalent total test scores are expected to have equal probabilities of success on the item.
A theoretically more appropriate framework for the study of DIF, which avoids the serious deficiencies inherent in CTT approaches, could be found in item response theory (IRT) (Hambleton and Rogers, 1988;Hambleton et al., 1991;Shepard et al., 1984;Devine and Raju, 1982). Their close connection to the most widely accepted definition of item bias has caused IRT-based methods to become popular and some researchers consider these methods "theoretically preferred" (Shepard et al., 1985;Lim and Drasgow, 1990). This definition states that an item is biased or shows DIF if examinees of the same ability but different sub-groups do not have the same probability of correct response to the item (Adams and Rowe, 1988;Lim and Drasgow, 1990;Mellenbergh, 1989;Hambleton et al., 1991;Raju et al., 1993). Central to IRT is a functional relationship between the probability Pi(θ), that individuals with specific ability θ will respond correctly to an item, and certain item characteristics, or item parameters (Lim and Drasgow, 1990). IRT models often use an item discrimination parameter (a), an item difficulty parameter (b), and a pseudo chance level (c). The graph of Pi (θ) as a function of θ is known as an item characteristic curve (ICC). Related to the DIF research, the most important advantage of IRT over CTT is the sub-population-invariant property of the ICC and item parameters. Analogous to the coefficients of a simple linear regression equation, these parameters are invariant when sub-population with different θ distributions is formed. Consequently, IRTbased item parameters do not confound DIF with subpopulation differences in ability (Lim and Drasgow, 1990). On the other hand, from an IRT perspective, DIF exists when individuals from different subgroups, but possessing identical levels of the latent trait (θ), have unequal probabilities of correctly answering an item. For an item to be labelled as measuring identically across groups, ICCs, item parameters must be identical (within sampling error) across different sub-populations (Lim and Drasgow, 1990). Thus, the study of DIF within an IRT framework is a matter of comparing the item characteristic curves (ICCs) for the two sub-groups (that is male vs. female) (Hambleton and Swaminathan, 1985;Mellenbergh, 1989).
In this study, ICCs were compared across gender groups to provide evidence on whether one of the groups had an advantage in syntax, connecting and synthesizing, vocabulary, extracting explicit information questions in the content areas of reading comprehension subtest of the English Language Achievement exam of School of Foreign Languages at Ege University, Turkey. More specifically, the present study was designed to accomplish the following purposes: (i.) to examine the direction of DIF for each test item in the reading comprehension subtest across males versus females, (ii.) to compare the magnitudes of the DIF indices between syntax, connecting and synthesizing, vocabulary, extracting explicit information items in the reading comprehension subtest and (iii.) to compare the different area indices used in the DIF analysis.

Subjects
The item level data of the English Reading Comprehension subtest were gathered from English Language Achievement Exam of School of Foreign Languages at Ege University, Turkey, in 2013. A sample of 2,117 examinees (1,011 males and 1,116 females) was randomly selected from all examinee population. About 52.7% were females, and their ages ranged from 17 to 19.

Instrument
The English Reading Comprehension Achievement subtest consists of a total of 40 items. It focuses on the measurement of the students' achievements for understanding and evaluating what is read in English. The items subtest can be classified into four types: (a) syntax, (b) connecting and synthesizing, (c) vocabulary, (d) extracting explicit information. All of the 40 English Reading Comprehension items are based on a multiple choice format with four item options. Reliability estimation for the sample was Öğretmen 1507 assessed using the KR-20 Coefficient. KR-20 Coefficient for the 40 English Reading Comprehension items was 0.82. Morover, the factor structure of the English Reading Comprehension subtest was examined in order to know if the data met the unidimensionality assumption of the IRT models. A scree test indicated that there was a sharp decrease from the first eigenvalue to the second in the total sample (5.93; 1.32; 1.11; 1.03, respectively). The structure observed in the factor analyses indicates that the English Reading Comprehension subtest includes a strong primary factor in general. This finding constitutes evidence for the claim that the data on the English Reading Comprehension subtest met the unidimensionality assumption of the IRT scaling.

DIF analysis procedure
The following steps were followed in DIF analysis: 1. A model was specified for the item characteristic curve, 2. Model-data-fit was investigated for the selected IRT model, 3. ICCs were separately computed on a common scale for each group, 4. Indices were computed indicating the degree of DIF on each item (Mellenbergh, 1989).
In the present study, IRT DIF area indices were used. A major advantage of this method is that the various ICC theory models describe the item parameters independently of the samples used for estimation (Devine and Raju, 1982). This means that the item parameters are independent of the distributional characteristics of the sample. Therefore, under any IRT model, parameters from different samples should be equal. Items whose parameters are notably unequal violate the assumptions of the model, and consequently are said to be biased because they may be measuring something different for a particular group (Adams and Rowe, 1988). DIF indices can be classified as weighted and un-weighted. If the number of examinees in two groups is not considered, then it is unweighted. In order to see the direction of the DIF, the signed area index is used. If un-weighted area index is used, then it could be signed or unsigned. In the unsigned area index the DIF appears, but the direction cannot be seen.
The un-weighted unsigned area index is: where P1i(θ) and P2i (θ) are the item characteristic curves in the first and second groups and p and q are the range of abilities of interest. The un-weighted signed area index is: The signed index shows the direction of DIF. If P1i(θ) is usually larger than P2i(θ), the measure is positive, indicating that the item shows DIF in favor of the first group. A negative index implies the DIF is in favor of the second group (Mellenberg, 1989). If the two groups are compared with different numbers of examinees, then the area between the two ICCs is weighted as Linn et al. proposed (Cited in Shepard et al., 1985). When the index is weighted, it gives more weight to parts of the ability scale where 1508 Educ. Res. Rev. most data are concentrated and where the difference in two probabilities at θj has a small variance. So, if weighted area index is used, then it could be signed or unsigned.
The weighted unsigned index used in the present study is: where; N is the total number of examinees in the two groups and si(θj) is the variance error of the difference in the two probabilities at θj.
The corresponding signed index is: The interpretation of area indices should be based on nonuniformed and uniformed DIF. The shape of the item characteristics curves can be the same in two groups when the curves do not coincide, as in the cases when the item is more difficult in male group than in the female group. This is called uniform DIF. It is also possible to have ICCs crossing each other on the ability continuum, which may indicate the non-uniformed DIF on the item. In this case ICCs cross ability point θ = 0. For abilities below 0, the item is more difficult in female group and for abilities above 0 the item is more difficult in male group. In the signed area indices, if there is nonuniformed DIF on the item, the areas below and above the crossed points can cancel each other and DIF may not be identified even though the ICCs do not coincide. Therefore, un-weighted (signed and unsigned) area indices were considered in order to obtain evidence about the uniformity and non-uniformity of the DIF in the comparisons. Kruskal-Wallis test was used to evaluate the magnitudes of the area indices in the reading comprehension subject areas of syntax, connecting and synthesizing, vocabulary, extracting explicit information items. To compare the area indices among the subject areas, correlation coefficients were used.
In the first step of the analysis, the fit of one, two and three parameter IRT models was calibrated in each of the gender groups. As a result of these analyses, it was decided to use the twoparameter Birnbaum model (1968) in the ICC comparisons since it served better overall fit statistics compared to the other models.
For the DIF analysis, two package programs were used. The BILOG computer program (Mislevy and Bock, 1986) with Bayesian modal estimation was used for test calibration. For the DIF analysis, a package program Calcbias developed by Oort (1992) was used.

RESULTS
Before the DIF analysis, the reading comprehension subtest was rescaled in order to equate the mean θ to 0 and the standard deviation to 1 across the gender groups for the purpose of establishing a comparable unit for the DIF analysis. Therefore, the comparisons of ICCs were made in the form of female versus male. Four area indices were calculated for each subject area of reading comprehension subtest as unweighted signed area "SA", unweighted unsigned area "USA", weighted signed area "WSA" and weighted unsigned area "WUSA" indices. The signed area indices indicated the direction of DIF. If the area index was positive in value, it was interpreted as DIF against the female group. A negative index implies the DIF is against the male group. In order to determine whether there are significant mean rank differences among the magnitudes of the area indices across different subject areas of the items with respect to gender groups, the Kruskal-Wallis non-parametric test was carried out. The significance levels of the tests are all evaluated as p=0.05. The results of the analyses were presented separately for each subtest.
After the first scaling, 40 reading comprehension items were rescaled for each sex, in order to compute the ICCs separately for each group on a common scale. Finally, indices were calculated indicating the degree of DIF on the items.
In the reading comprehension subtest, 4 items are in the syntax, 17 items are in the connecting and synthesizing, 13 items are in the vocabulary and 6 items are in the extracting explicit information sections with respect to subject areas. The area indices with respect to the gender groups are given in Table 1.
When the values of the un-weighted indices are evaluated there are 4 indices (10.0%) identified as jump outs in the un-weighted signed area (SA), but there are 12 indices (30.0%) identified as jump outs in un-weighted unsigned area (USA). However, when weighted indices are considered, there are no jump out indices observed in either WSA or WUSA.
On the other hand, when both un-weighted and weighted area indices (SA and USA) were compared, it was seen that the values of the unsigned area indices are larger than the signed area indices. This is the evidence of the existence of non-uniformity in the ICCs across the gender groups.
As it is shown in Table 1, DIF has been observed with respect to both weighted and un-weighted area indices. There are 21 items (52.5%) which showed DIF against males according to un-weighted signed area (SA) indices. On the other hand, 13 of 40 items (32.5%) indicate that DIF is against males according to weighted signed area indices (WSA).
As indicated in Table 1, 19 of 40 items (47.5%) showed DIF is against females with respect to the un-weighted signed area indices (SA). However, 27 items (67.5%) indicated DIF is against females with respect to the weighted signed area indices (WSA). On the other hand, according to the weighted area indices there are more items, which showed DIF in favor of males.   showed DIF in favor of the same group. As for the vocabulary section, 6 items (46.15%) showed DIF in favor of females with respect to the un-weighted signed area (SA) indices; in the weighted signed area (WSA) indices only 1 item (7.69%) indicated DIF in favor of the same group (Table 1). On the other hand, 5 items (83.33%) showed DIF in favor of females according to the un-weighted signed area (SA) indices with respect to extracting explicit information section. For the weighted signed area (WSA) indices, 13 items (50.0%) showed DIF in favor of the female group with respect to vocabulary section (Table  2).

Un-weighted
In order to compare the magnitudes of the area indices with respect to subject areas of the reading comprehension subtest items, the non-parametric Kruskal-Wallis test was used.
According to the results, no significant mean rank differences have been observed among the area indices across the items in the subject areas of syntax, connecting and synthesizing, vocabulary and extracting explicit information.
The correlations among the four area indices in the Reading Comprehension subtest comparisons indicated that there were two significant correlations (Table 3).
The first one is observed between the un-weighted signed area (SA) indices and the weighted signed area (WSA) indices since the correlation coefficient, r=.407 is significant at p=.000. The second correlation was observed between the weighted signed (WSA) indices and the weighted unsigned (WUSA) area indices which produced r=.71 that is significant at p=.000.

DISCUSSION AND CONCLUSION
When the DIF analysis of the reading comprehension subtest is considered, the items do not function as intended since they are not free from DIF. In the subtest, most of the items (21 items) favored the female group with respect to the un-weighted signed area indices. In contrast, with respect to the weighted signed (WSA) area indices, 27 of the 40 cases indicated DIF in favor of the male group. When the mean theta levels obtained for each group are compared, a difference is observed across the gender groups that accounts for the weighting of the ICCs in the DIF analysis. For this reason, considering the weighted area indices might be more reasonable for the Turkish subtest. Therefore, if the weighted indices are considered, males have a greater advantage in responding to the reading comprehension items. The result of the study also indicated that the differences between the magnitudes of the un-weighted SA and the un-weighted USA indices showed non-uniformed DIF in almost all of the cases when ICCs were compared. This finding implies that the students from different sub-groups with identical levels of ability have unequal probability of answering the items correctly. In this subtest, jump outs were observed only in the un-weighted indices.
Analysis also revealed that there is a significant correlation observed between the un-weighted SA indices and the weighted signed area WSA indices in the data set of the reading comprehension subtest. However, it is not a high relationship. The correlation coefficient of r=.407 is significant at p=.000. But it is not surprising to obtain low correlation among the different area indices because there are some differences in the calculations of the four area indices. Furthermore, the weighted area indices (WSA and WUSA) have a significant and high relationship among themselves. When the different DIF indices were considered, the opposite results were obtained. For this reason, there was a problem in deciding which area indices were more meaningful in the DIF analysis. If the mean theta level was equal across ICCs, it should be weighted. Otherwise, the un-weighted indices might be useful.
With respect to the signed area (SA) indices, the female group had an advantage in responding comprehension items. However, with respect to the weighted signed area (WSA) indices, the male group seemed to have a greater advantage. In the subtest, the unweighted indices indicated non-uniformed DIF for almost all of the items in the ICCs compared. In this subtest jump outs were observed only in the un-weighted indices (SA and USA) also. This implies that the students from different subgroups with identical levels of ability have unequal probability of answering the items correctly as indicated by Adams and Rowe (1988), Lim and Drasgow (1990), Mellenbergh (1989), Hambleton et al. (1991) and Raju et al. (1993).
Further, when assessing reading comprehension as a second languge, instructors should be more sensitive to the gender differences in the performance of specific reading comprehension items that favor one gender group against another, which will contribute to a fair assessment of reading comprehension as indicated by Pae (2004b). On the other hand, the design of a reading comprehension curriculum should include the information about the differential performance by gender, as reported in this study. Applying this way, teachers can take proactive steps to minimize gender differences in higher levels of thinking skills that are critical to efficient reading comprehension.

RECOMENDATIONS FOR FURTHER RESEACH
Further DIF studies can be made by including examinees coming from different school types, branches and Öğretmen 1511 curricula. Statistical properties of the subtests should be examined in a more detailed way to analyze the reasons for the DIF observed in the Reading Comprehension subtest and special emphasis should be given to analyze whether selection decisions are affected by DIF or not.