Evaluation of Open-ended Question’s Answers using Large Language Model (LLM): A Case Study of a Language Learning Center in University

Authors

Keywords:

MOOC, generative ai, artificial intelligence, large language model, prompt engineering technique, open-ended question, English course

Abstract

Massive open online courses (MOOCs) are a transformative tool in education. Their benefits are increasingly significant along with the development of artificial intelligence (AI) in the form of Large Language Models (LLMs). However, the use of LLMs in education raises several ethical issues, such as accuracy and fairness, especially when LLMs are used for assessment or evaluation. This is because of the risk of bias, inaccuracy, and inconsistency in the AI output. These risks can be reduced with the development of LLM capabilities and the use of prompt engineering techniques, namely, a way of communicating with LLM models. This study aims to evaluate the influence of factors affecting the accuracy of evaluation results from open-ended questions in an English MOOC at a university’s language learning center. The results of this evaluation are used to determine recommendations for the consistent and accurate use of LLMs in evaluating open-ended questions in English courses using the MOOC platform. This quantitative quasi-experimental research involved 580 participants divided into three proficiency groups who answered open-ended questions. Participants’ answers were evaluated by one human rater and three LLMs using three prompt engineering techniques to generate an evaluation result. The assessment results were analyzed using a three-way analysis of variance (ANOVA) to examine the factors influencing the LLM output. The mean absolute difference (MAE) between the human and LLM assessment results was used to assess the accuracy of LLM. The evaluation result was then calculated using the quadratic weighted kappa to generate inter-rater reliability to assess the raters’ consistency. The results showed that the participant group, LLM model used, and prompt engineering technique used influenced the assessment results. The ability level of the evaluated participants had the greatest impact on the results. A combination of LLM and prompt engineering technique, ChatGPT-4.1 with Chain-of-Thought, provided the best results, but it should not be used for high-stakes assessments. LLM can be used for initial assessment or as a complement to human assessment results without replacing actual human raters for high-stakes assessments.

Downloads

Download data is not yet available.

Downloads

Published

2026-05-01

How to Cite

Faisal Wilmar, & Panca Oktavia Hadi Putra. (2026). Evaluation of Open-ended Question’s Answers using Large Language Model (LLM): A Case Study of a Language Learning Center in University. Jurnal Sistem Informasi, 22(1). Retrieved from https://jsi.cs.ui.ac.id/index.php/jsi/article/view/1550