Abstract
Natural language processing (NLP) is widely used to predict human scores for open-ended student assessment responses in various content areas (Johnson et al., 2022). Ensuring algorithmic fairness based on student demographic background factors is crucial (Madnani et al., 2017). This study presents a fairness analysis of six top-performing entries from a data challenge involving 20 NAEP reading comprehension items that were initially analyzed for fairness based on race/ethnicity and gender. This study describes additional fairness evaluation including English Language Learner Status (ELLs), Individual Education Plans, and Free/Reduced-Price Lunch. Several items showed lower accuracy for predicted scores, particularly for ELLs. This study recommends considering additional demographic factors in fairness scoring evaluations and that fairness analysis should consider multiple factors and contexts.
Recommended Citation
Beiting-Parrish, Maggie and Whitmer, John
(2023)
"Lessons Learned about Evaluating Fairness from a Data Challenge to Automatically Score NAEP Reading Items,"
Chinese/English Journal of Educational Measurement and Evaluation | 教育测量与评估双语期刊: Vol. 4:
Iss.
3, Article 5.
DOI: https://doi.org/10.59863/NKCJ9608
Available at:
https://www.ce-jeme.org/journal/vol4/iss3/5
DOI
https://doi.org/10.59863/NKCJ9608
Included in
Accessibility Commons, Educational Assessment, Evaluation, and Research Commons, Educational Methods Commons, Educational Psychology Commons, Educational Technology Commons, Language and Literacy Education Commons