Item Response Theory-Based Methods for Estimating Classification Accuracy and Consistency

Item Response Theory-Based Methods for Estimating Classification Accuracy and Consistency

Authors

  • University of Massachusetts Amherst
  • University of Massachusetts Amherst

Keywords:

Criterion-Referenced Testing, Classification Consistency, Classification Accuracy, Item Response Theory, Psychometric Software, Reliability

Abstract

Whenever classification decisions are made on educational tests, such as pass/fail, or basic, proficient, or advanced, the consistency and accuracy of those decisions should be estimated and reported. Methods for estimating the reliability of classification decisions made on the basis of educational tests are well-established (e.g., Rudner, 2001; Rudner, 2005; Lee, 2010). However, they are not covered in most measurement textbooks and so they are not widely known. Moreover, few practitioners are aware of freely available software that can be used to implement current methods for evaluating decision consistency and decision accuracy that are appropriate for contemporary educational assessments. In this article, we describe current methods for estimating decision consistency and decision accuracy and provide descriptions of “freeware†software that can estimate these statistics. Similarities and differences across these software packages are discussed. We focus on methods based on item response theory, which are particularly well-suited to most 21st century assessments.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Downloads

Published

2018-09-28

How to Cite

Diao, H., & Sireci, S. G. (2018). Item Response Theory-Based Methods for Estimating Classification Accuracy and Consistency. Journal of Applied Testing Technology, 19(1), 20–25. Retrieved from http://www.jattjournal.net/index.php/atp/article/view/131016

Issue

Section

Articles

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and psychological testing. Washington, DC: American Educational Research Association.

Bourque, M. L., Goodman, D., Hambleton, R. K., & Han, N. (2004). Reliability estimates for the ABTE tests in elementary education, professional teaching knowledge, secondary mathematics and English/language arts (Final Report).

Connecticut State Department of Education. (2013). The Connecticut mastery test: Technicalreport.

Deng, N., (2011). Evaluating IRT- and CTT-based methods of estimating classification consistency and accuracy indices from single administrations. (Unpublished doctoral dissertation). Amherst, MA: University of Massachusetts.

Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment Research & Evaluation, 11(6). Available from http://pareonline.net/getvn.asp?v=11&n=6

Hambleton, R. K, & Novick, M. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10(3), 159-170. https://doi.org/10.1111/j.1745-3984.1973.tb00793.x

Lathrop, Q. N., & Cheng, Y. (2014). A Nonparametric Approach to Estimate Classification Accuracy and Consistency. Journal of Educational Measurement, 51(3), 318-334. https://doi.org/10.1111/jedm.12048

Lathrop, Quinn (2015). Practical Issues in Estimating Classification Accuracy and Consistency with R Package cacIRT. Practical Assessment, Research & Evaluation, 20(18). Available from http://pareonline.net/getvn.asp?v=20&n=18.

Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47(1), 1-17. https://doi.org/10.1111/j.1745-3984.2009.00096.x

Lee, W., & Kolen, M. J. (2008). IRT-CLASS: A computer program for item response theory classification consistency and accuracy (Version 2.0) [Computer software]. Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, University of Iowa. Available at http:// www.education.uiowa.edu/casma. https://doi.org/10.1017/S0009840X07002582

Liang, T., Han, K. T., & Hambleton, R.K. (2009). ResidPlots-2: Computer software for IRT graphical residual analyses. Applied Psychological Measurement, 33(5), 411-412. [software package available at available at https://www.umass.edu/remp/main_ software.html] https://doi.org/10.1177/0146621608329502

Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32, 179-197. https://doi.org/10.1111/j.1745-3984.1995.tb00462.x

Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatingsâ€. Applied Psychological Measurement, 8(4), 453-461. https://doi.org/10.1177/014662168400800409

Masters, G. N. (1982). A rasch model for partial credit scoring. Psychometrika, 47(2), 149-174. doi:10.1007/bf02296272 https://doi.org/10.1007/BF02296272

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176. https://doi.org/10.1177/014662169201600206

Pearson (2012). The Puerto Rico Pruebas Puertorrique-as de Aprovechamiento Académico (PPAA) technical manual. Austin: Author,

Pearson. (2015). Technical Manual for Minnesota’s Title I and Title III Assessments for the Academic Year 2014-2015. Roseville, MN: Minnesota Department of Education. Available from: http://education.state.mn.us/MDE/SchSup/TestAdmin/ MNTests/TechRep/

Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56(4), 611-630. https://doi.org/10.1007/BF02294494

Rasch, G. (1960). Probabilistic models for some intelligence and achievement tests. Copenhagen: Danish Institute for Educational Research.

Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment Research & Evaluation, 7(14). Retrieved from http://PAREonline.net/ getvn.asp?v=7&n=14

Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13). Available from: http://pareonline.net/pdf/v10n13.pdf

Sireci, S. G., Baldwin, P., Martone, A., Zenisky, A. L., Kaira, L., Lam, W., Shea, C. L., Han, K. T., Deng, N., Delton, J., & Hambleton, R. K. (2008, April). Massachusetts Adult Proficiency Tests technical manual: Version 2. Amherst, MA: Center for Educational Assessment, University of Massachusetts Amherst.

Wheadon, C. (2014). Classification accuracy and consistency under item response theory models using the package classify. Journal of Statistical Software, 56(10), 1-14. https://doi.org/10.18637/jss.v056.i10

Loading...