Exploring Item Bank Stability through Live and Simulated Datasets
DOI: 10.23977/langta.2022.050102 | Downloads: 22 | Views: 779
Tony Lee 1, David Coniam 1, Michael Milanovic 1
1 LanguageCert, UK
Corresponding AuthorTony Lee
LanguageCert manages the construction of its tests, exams and assessments using a sophisticated item banking system which contains large amounts of test material that is described, inter alia, in terms of content characteristics such as macroskills, grammatical and lexical features and measurement characteristics such as Rasch difficulty estimates and fit statistics. In order to produce content and difficulty equivalent test forms, it is vital that the items in any LanguageCert bank manifest stable measurement characteristics.
The current paper is one of two linked studies exploring the stability of one of the item banks developed by LanguageCert [Note 1]. This particular bank has been used as an adaptive test bank and comprises 820 calibrated items. It has been administered to over 13,000 test takers, each of whom have taken approximately 60 items. The purpose of these two exploratory studies is to examine the stability of this adaptive test item bank from both statistical and operational perspectives.
The study compares test taker performance in the live dataset with over 13,000 test takers (where each test taker takes approximately 60 items) with a simulated ‘full’ dataset generated using model-based imputation. Simulation regression lines showed a good match and Rasch fit statistics were also good: thus indicating that items comprising the adaptive item bank are of high quality both in terms of content and statistical stability. Potential future stability was confirmed by results obtained from a Bayesian ANOVA. As mentioned above, such item bank stability is important when item banks are used for multiple purposes, in this context for adaptive testing and the construction of linear tests. The current study therefore lays the ground work for a follow-up study where the utility of this adaptive test item bank is verified by the construction, administration and analysis of a number of linear tests.
KEYWORDSItem banks, stability, simulated dataset, Rasch, Bayesian ANOVA
CITE THIS PAPER
Tony Lee, David Coniam, Michael Milanovic, Exploring Item Bank Stability through Live and Simulated Datasets. Journal of Language Testing & Assessment (2022) Vol. 5: 13-21. DOI: http://dx.doi.org/10.23977/langta.2022.050102.
 Andraszewicz, S., Scheibehenne, B., Rieskamp, J., Grasman, R., Verhagen, J., & Wagenmakers, E. J. (2015). An introduction to Bayesian hypothesis testing for management research. Journal of Management, 41(2), 521-543. https://doi.org/10.1177/0149206314560412.
 Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: fundamental measurement in the human sciences (2nd ed.). Mahwah, N.J.: Erlbaum.
 Choppin, B. (1968). Item Bank using sample-free calibration. Nature, 219, 870-872. https://doi.org/10.1038/219870a0.
 Coniam, D., Lee, T., Milanovic, M. & Pike, N. (2021). Validating the LanguageCert Test of English scale: The adaptive test. LanguageCert: London, UK.
 Coniam, D., Lee, T., Milanovic, M. (2022). Exploring Item Bank Stability in the Creation of Multiple Test Forms. LanguageCert: London, UK.
 Derner, S., Klein, S., & Hilber, D. (2008). Assessing the Feasibility of a Test Item Bank and Assessment Clearinghouse: Strategies to Measure Technical Skill Attainment of Career and Technical Education Participants. MPR Associates, Inc.
 Gao, F., & Chen, L. (2005). Bayesian or non-Bayesian: A comparison study of item parameter estimation in the three parameter logistic model. Applied Measurement in Education, 18(4), 351-380. doi:10.1207/s15324818ame1804_2.
 Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21, 1157-1164. https://doi.org/10.3758/s13423-013-0572-3
 Huisman, M., & Molenaar W. I. (2001). Imputation of missing scale data with item response models. In Boomsma, A., van Duijn, M., & Snijders, T. (Eds.). Essays on item response theory (pp. 221-244). New York: Springer-Verlag. https://doi.org/10.1007/978-1-4613-0169-1_13.
 Jeffreys, H. 1961. Theory of probability (3rd ed.). New York: Oxford University Press.
 Li, P., Stuart, E. A., & Allison, D. B. (2015). Multiple imputation: a flexible tool for handling missing data. Jama, 314(18), 1966-1967. https://doi.org/10.1001/jama.2015.15281.
 Linacre, J. M. (2012). A user's guide to WINSTEPS. Chicago, IL: Winsteps.com.
 Linacre, J. M. (2018). Winsteps Rasch measurement computer program user's guide. Beaverton, OR.
 Lunz, M. & Stahl, J. (1990). Judge consistency and severity across grading periods. Evaluation and the Health Profession, 13, 425-444. https://doi.org/10.1177/016327879001300405.
 Meijer, R. R. (1996). Person-fit research: An introduction. Applied Measurement in Education, 9(1), 3-8. https://doi.org/10.1207/s15324818ame0901_2.
 Mills, C. N., & Steffen, M. (2000). The GRE computer adaptive test: Operational issues. In Computerized adaptive testing: Theory and practice (pp. 75-99). Dordrecht: Springer. https://doi.org/10.1007/0-306-47531-6_4.
 Mislevy, R., & Wu, P. (1988). Inferring examinee ability when some item responses are missing (RR-88-48-ONR). Princeton NJ: Educational Testing Service. https://doi.org/10.21236/ADA201421.
 Peugh, J. L., & Enders, C. K. (2004). Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research, 74, 525-556. https://doi.org/10.3102/00346543074004525.
 Ringle, C. M., & Sinkovics, R. R. (2009). The use of partial least squares path modeling in international marketing. Advances in International Marketing, 20, 277-319. https://doi.org/10.1108/S1474-7979(2009)0000020014.
 Roth, P. (1994). Missing data: A conceptual review for applied psychologists. Personnel Psychology, 47, 537-560. https://doi.org/10.1111/j.1744-6570.1994.tb01736.x.
 Rudner, L. M. (2009). Implementing the graduate management admission test computerized adaptive test. In Elements of adaptive testing (pp. 151-165). New York, NY: Springer. https://doi.org/10.1007/978-0-387-85461-8_8.
 Sahin, A., & Weiss, D. J. (2015). Effects of calibration sample size and item bank size on ability estimation in computerized adaptive testing. Educational Sciences: Theory & Practice, 15(6), 1585-1595.
 Schminkey, D. L., von Oertzen, T., & Bullock, L. (2016). Handling missing data with multilevel structural equation modeling and full information maximum likelihood techniques. Research in Nursing & Health, 39(4), 286-297. https://doi.org/10.1002/nur.21724.
 Voss, S., & Blumenthal, Y. (2020). Assessing the Word Recognition Skills of German Elementary Students in Silent Reading-Psychometric Properties of an Item Pool to Generate Curriculum-Based Measurements. Education Sciences, 10(2), 35. https://doi.org/10.3390/educsci10020035.
 Vriens, M., & Melton, E. (2002). Managing missing data. Marketing Research, 14(3), 12.
 Weiss, D. J., & von Minden, S. V. (2012). A comparison of item parameter estimates from Xcalibre 4.1 and Bilog-MG. St. Paul, MN: Assessment Systems Corporation.
 Wright, B. D. (1997). A history of social science measurement. Educational Measurement: Issues and Practice, 16(4), 33-45. https://doi.org/10.1111/j.1745-3992.1997.tb00606.x.
 Zhang B., & Walker C. M. (2008). Impact of missing data on person-model fit and person trait estimation. Applied Psychological Measurement 32(6) 466-479. https://doi.org/10.1177/0146621607307692.