Evaluation of Open-ended Question’s Answers using Large Language Model (LLM): A Case Study of a Language Learning Center in University
Keywords:
MOOC, generative ai, artificial intelligence, large language model, prompt engineering technique, open-ended question, English courseAbstract
Massive open online courses (MOOCs) are a transformative tool in education. Their benefits are increasingly significant along with the development of artificial intelligence (AI) in the form of Large Language Models (LLMs). However, the use of LLMs in education raises several ethical issues, such as accuracy and fairness, especially when LLMs are used for assessment or evaluation. This is because of the risk of bias, inaccuracy, and inconsistency in the AI output. These risks can be reduced with the development of LLM capabilities and the use of prompt engineering techniques, namely, a way of communicating with LLM models. This study aims to evaluate the influence of factors affecting the accuracy of evaluation results from open-ended questions in an English MOOC at a university’s language learning center. The results of this evaluation are used to determine recommendations for the consistent and accurate use of LLMs in evaluating open-ended questions in English courses using the MOOC platform. This quantitative quasi-experimental research involved 580 participants divided into three proficiency groups who answered open-ended questions. Participants’ answers were evaluated by one human rater and three LLMs using three prompt engineering techniques to generate an evaluation result. The assessment results were analyzed using a three-way analysis of variance (ANOVA) to examine the factors influencing the LLM output. The mean absolute difference (MAE) between the human and LLM assessment results was used to assess the accuracy of LLM. The evaluation result was then calculated using the quadratic weighted kappa to generate inter-rater reliability to assess the raters’ consistency. The results showed that the participant group, LLM model used, and prompt engineering technique used influenced the assessment results. The ability level of the evaluated participants had the greatest impact on the results. A combination of LLM and prompt engineering technique, ChatGPT-4.1 with Chain-of-Thought, provided the best results, but it should not be used for high-stakes assessments. LLM can be used for initial assessment or as a complement to human assessment results without replacing actual human raters for high-stakes assessments.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Jurnal Sistem Informasi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).




