Education, Science, Technology, Innovation and Life
Open Access
Sign In

Gender-related Differential Item Functioning Analysis on an ESL Test

Download as PDF

DOI: 10.23977/langta.2020.030102 | Downloads: 75 | Views: 3205


Don Yao 1, Kayla Chen 2


1 Language Assessment Seminar Research (LASeR) Group, Department of English, Faculty of Arts and Humanities, University of Macau, Macau, China
2 Xingang Middle School, Huangpu District, Guangzhou, China

Corresponding Author

Don Yao


Differential item functioning (DIF) is a technique used to examine whether items function differently across different groups. The DIF analysis helps detect bias in an assessment to ensure the fairness of the assessment. However, most of the previous research has focused on high-stakes assessments. There is a dearth in research that laying emphasis on low-stakes assessments, which is also significant for the test development and validation process. Additionally, gender difference in test performance is always a particular concern for researchers to evaluate whether a test is fair or not. This present study investigated whether test items of the General English Proficiency Test for Kids (GEPT-Kids) are free of bias in terms of gender differences. A mixed-method sequential explanatory research design was adopted with two phases. In phase I, test performance data of 492 participants from five Chinese speaking cities were analyzed by the Mantel-Haenszel (MH) method to detect gender DIF. In phase II, items that manifested DIF were subject to content analysis through three experienced reviewers to identify possible sources of DIF. The results showed that three items were detected with moderate gender DIF through statistical methods and three items were identified as possible biased items by expert judgment. The results provide preliminary contributions to DIF analysis for low-stakes assessment in the field of language assessment. Besides, young language learners, especially in the Chinese context, have been drawn renewed attention. Thus, the results may also add to the body of literature that can shed some light on the test development for young language learners.


Differential Item Functioning, Gender, Mantel-Haenszel Method, Young Language Learners, Bias


Don Yao, Kayla Chen, Gender-related Differential Item Functioning Analysis on an ESL Test. Journal of Language Testing & Assessment (2020) Vol. 3: 5-19. DOI:


[1] Ahmadi, A., & Bazvand, A. D. (2016). Gender Differential Item Functioning on a National Field-Specific Test: The Case of PhD Entrance Exam of TEFL in Iran. Iranian Journal of Language Teaching Research, 4(1), 63- 82.
[2] Alderman, D. L., & Holland, P. W. (1981). Item performance across native language groups on the Test of Egnlish as a Foreign Language. ETS Research Report Series.
[3] American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for educational and psychological testing. AREA.
[4] Amirian, S. M. R., Alavi, S. M., & Fidalgo, A. M. (2014). Detecting gender DIF with an English proficiency test in EFL context. Iranian Journal of Language Testing, 4(2), 187-203.
[5] Angoff, W. H. (1993). Perspectives on differential item functioning methodology. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Lawrence Erlbaum Associates, Inc.
[6] Aryadoust, V., Goh, C. C., & Kim, L. O. (2011). An investigation of differential item functioning in the MELAB listening test. Language Assessment Quarterly, 8(4), 361-385.
[7] Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge University Press.
[8] Bailey, A. L. (2008). Assessing the language of young learners. In N. H. Hornberger (Ed.), Encyclopedia of language and education (pp. 379-398). SpringerLink.
[9] Banerjee, J., & Papageorgiou, S. (2016). What’s in a Topic? Exploring the Interaction Between Test-taker Age and Item Content in High-Stakes Testing. International Journal of Listening, 30(1-2), 8-24.
[10] Bolger, N., & Kellaghan, T. (1990). Method of measurement and gender differences in scholastic achievement. Journal of Educational Measurement, 27(2), 165-174.
[11] Boyle, J. P. (1987). Sex differences in listening vocabulary. Language Learning, 37(2), 273-284.
[12] Brantmeier, C. (2003). Beyond linguistic knowledge: Individual differences in second language reading. Foreign Language Annals, 36 (1), 33-43.
[13] Camilli, G., & Shepard, L. A. (1994). Methods for identifying biased test items (Vol. 4). Sage.
[14] Carlton, S. T., & Harris, A. M. (1992). Characteristics associated with differential item functioning on the Scholastic Aptitude Test: Gender and majority/minority group comparisons. ETS Research Report Series.
[15] Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language Testing, 2(2), 155-163.
[16] Chiu, H. H. (2008). Are there any gender differences in the GEPT picture description listening comprehension test? Chia Nan Annual Bulletin: Humanity, 34, 409-422.
[17] Conoley, C. A. (2003). Differential item functioning in the Peabody Picture Vocabulary Test (3rd Edition): Partial correlation versus Expert judgment. PhD Thesis, Texas A&M University.
[18] Creswell, J. W., Tashakkori, A., Jensen, K. D., & Shapley, K. L. (2003). Teaching mixed methods research: Practices, dilemmas, and challenges. In A. Tashakkori, & C. Teddlie (Eds.), Handbook of mixed methods in social and behavioral research (pp. 91-110). Sage Publications.
[19] DeMars, C. E. (2000). Test stakes and item format interactions. Applied Measurement in Education, 13(1), 55-77. 
[20] Dorans, N. J., & Paul, W. (1993). DIF detection and description: Mantel-Haenszel and standardization. In P. W.Holland & H. Wainer (Eds.), Differential item functioning (pp.3-23). Lawrence Erlbaum Associates, Inc.
[21] Ebel, R. L. & Frisbie, D. A. (1986). Essentials of educational measurement (4th Ed.). Prentice-Hall.
[22] Ellis, L., & Ficek, C. (2001). Color preferences according to gender and sexual orientation. Personality and Individual Differences, 31(8), 1375-1379.
[23] Fernandes, A. C. (2015). Gender differential item functioning on English as a foreign language pragmatic competence test: Implications for English assessment policy in China. PhD Thesis, Niagara University.
[24] Ferne, T., & Rupp, A. A. (2007). A synthesis of 15 years of research on DIF in language testing: Methodological advances, challenges, and recommendations. Language Assessment Quarterly, 4(2), 113-148.
[25] Fidalgo, Á. M., Hashimoto, K., Bartram, D., & Muñiz, J. (2007). Empirical bayes versus standard Mantel-Haenszel statistics for detecting differential item functioning under small sample conditions. The Journal of Experimental Education, 75(4), 293-314.
[26] Geranpayeh, A., & Kunnan, A. J. (2007). Differential item functioning in terms of age in the certificate in advanced English examination. Language Assessment Quarterly, 4(2), 190-222.
[27] Hasselgreen, A., & Caudwell, G. (2016). Assessing the language of young learners. Equinox Publishing.
[28] Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. Test Validity, 129-145.
[29] Hoyenga, K. B., & Wallace, B. (1979). Sex differences in the perception of autokinetic movement of an afterimage. The Journal of General Psychology, 100(1), 93-101.
[30] Ivankova, N. V., Creswell, J. W., & Stick, S. (2006). Using mixed-methods sequential explanatory design: From theory to practice. Field Methods, 18(3), 3-20.
[31] Jiao, H., & Chen, Y. (2014) Differential item and testlet functioning analysis. In A. J. Kunnan (Ed., 1st volume), The Companion to Language Assessment (pp.1282-1300). John Wiley & Sons.
[32] Kunnan, A. J., (1990). DIF in native language and gender groups in an ESL placement test. TESOL Quarterly, 24(4), 741-746.
[33] Kunnan, A. J. (2004). Test fairness. In M. Milanovic, & C. Weir (Eds.), Europe language testing in a global context: Selected papers from the ALTE conference in Barcelona (pp.27-48). Cambridge University Press. 
[34] Kunnan, A. J. (2007). Test fairness, test bias, and DIF. Language Assessment Quarterly, 42(2), 109-112. 
[35] Kunnan, A. J. (2010). Fairness matters and Toulmin’s argument structures. Language Testing, 24(2), 183-189.
[36] Kunnan, A.J. (2017). Evaluating language assessments. Routledge.
[37] Lei, X. (2007). Shanghai gaokaoyingyufenshu de xingbiechayi he yuanying [Gender differences and their sources on the National Maculation English Test in the Shanghai area]. Shanghai Research on Education, 6, 43-46.
[38] Liao, Y. (2016). Gender differences and differential item functioning on the English GSAT multiple-choice questions. Soochow Journal of Foreign Languages and Cultures, (41), 21-59.
[39] Liu, B. & Li, Y. (2010). Opportunities and barriers: Gendered reality in Chinese higher education. Frontiers of Education in China, 5, 197-221.
[40] Lin, J., & Wu, F. (2004). Differential performance by gender in foreign language testing. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, USA.
[41] Lorenz, R. (2016). Does gender make a difference? Gender-related fairness of high-stakes testing in A-level examinations in English as foreign language in the German state of North Rhine-Westphalia in the context of Educational Governance. Journal for Educational Research Online, 8(2), 10-30.
[42] Martinková, P., Drabinová, A., Liaw, Y. L., Sanders, E. A., McFarland, J. L., & Price, R. M. (2017). Checking equity: Why differential item functioning analysis should be a routine part of developing conceptual assessments. CBE-Life Sciences Education, 16(2), 2.
[43] Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719-748.
[44] McKay, P. (2006). Assessing young language learners. Cambridge University Press.
[45] Mullis, I. V., Martin, M. O., Foy, P. & Drucker, K. T. (2012). PIRLS 2011 International Results in Reading. International Association for the Evaluation of Educational Achievement, Amsterdam, the Netherlands.
[46] O’Neill, K. A., McPeek, W. M., (1993). Item and test characteristics that are associated differentialitem functioning. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 3-23). Lawrence Erlbaum Associates, Inc.
[47] O’Neill, K. A., McPeek, W. M., & Wild, C. L. (1993). Differential Item Functioning on the Graduate Management Admission Test (ETS-RR-35). Educational Testing Service.
[48] Padilla, J. L., Hidalgo, M., Benítez, I., & Gómez -Benito, J. (2012). Comparison of three software programs for evaluating DIF by means of the Mantel-Haenszel procedure:   EASY-DIF, DIFAS and EZDIF. Psicológica, 33(1), 135-136.
[49] Pae, T. I. (2004). DIF for examinees with different academic backgrounds. Language Testing, 21(1), 53-73. 
[50] Pae, T. I. (2012). Causes of gender DIF on an EFL language test: A multiple-data analysis over nine years. Language Testing, 29(4), 533-554.
[51] Papp, S., & Rixon, S. (2018). Examining young learners: Research and practice in assessing the English of School-age Learners. In N. Saville, & C. J. Weir (Eds., Vol. 47), Studies in language testing. Cambridge University Press.
[52] Park, G. P. (2008). Differential item functioning on an English listening test across gender. TESOL Quarterly, 42(1), 115-123.
[53] Penfield, R. D. (2003). Applying the Breslow-Day Test of Trend in Odds Ratio Heterogeneity to the Analysis of Non-uniform DIP. Alberta Journal of Educational Research, 49(3), 231-243.
[54] Penfield, R. D. (2005). DIFAS: Differential Item Functioning Analysis System. Applied Psychological Measurement, 29(2), 150-151.
[55] Penfield, R. D. (2012). DIFAS: 5.0 user’s manual. DIFASManual_V5.pdf
[56] Ryan, K. E., & Bachman, L. F. (1992). Differential item functioning on two tests of EFL proficiency. Language Testing, 9(1), 12-29.
[57] Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17(2), 105-116.
[58] Scheuneman, J. D., & Gerritz, K. (1990). Using differential item functioning procedures to explore sources of item difficulty and group performance characteristics. Journal of Educational Measurement, 27(2), 109-131.
[59] Shepard, L., Camilli, G., & Averill, M. (1981). Comparison of procedures for detecting test-item bias with both internal and external ability criteria. Journal of Educational Statistics, 6(4), 317-375.
[60] Shohamy, E. (1984) Does the testing method make a difference? The case of reading comprehension. Language Testing, 1, 147-170.
[61] Shohamy, E., & Inbar, O. (1991). Validation of listening comprehension tests: The effect of text and question type. Language Testing, 8, 23-40.
[62] Sirikit, R., Choptham, M., Mahalawalert, P., Tagontong, N., & Apinyapibal, S. (2016). An investigation of differential item functioning and differential test functioning of SWUSAT during 2010-2013. Scholar: Human Sciences, 8(2).
[63] Song, X., Cheng, L., & Klinger, D. (2015). DIF investigations across groups of gender and academic background in a large-scale high-stakes language test. Papers in Language Testing and Assessment 4 (1), 97-124.
[64] Swinton, S. S., & Powers, D.E. (1980). Factor analysis of the Test of English as a Foreign Language for several language groups. ETS Research Report Series.
[65] Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test. Language Testing, 17(3), 323-340.
[66] Uiterwijk, H., & Vallen, T. (2005). Linguistic sources of item bias for second generation immigrants in Dutch tests. Language Testing, 22(2), 211-234.
[67] Van den Broeck, J., Bastiaansen, L., Rossi, G., Dierckx, E., & De Clercq, B. (2013). Age-neutrality of the trait facets proposed for personality disorders in DSM-5: A DIFAS analysis of the PID-5. Journal of Psychopathology and Behavioral Assessment, 35(4), 487-494.
[68] Walker, C. M. (2011). What’s the DIF? Why differential item functioning analyses are an important part of instrument development and validation. Journal of Psychoeducational Assessment, 29(4), 364-376.
[69] Wise, S. L., & DeMars, C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions, Educational Assessment, 10(1), 1-17.
[70] Wu, J. (2009). Differential Item Functioning in gender and living background groups in the GEPT. Paper presented at the 13th International Conference on Language Education, Kaohsiung, Taiwan.
[71] Zhu, Y. Y., Wu, Q. M., & Jiao, L. Y. (2017). Analysis of Differential Item Functioning in A Primary English Test. China Examinations, 4, 54-56.
[72] Zieky, M. (2003). A DIF primer. 
[73] Zumbo, B. D. (2007). Three generations of DIF analyses: Considering where it has been, where it is now, and where it is going. Language Assessment Quarterly, 4(2), 223-233.

All published work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright © 2016 - 2031 Clausius Scientific Press Inc. All Rights Reserved.