This paper explores the automatic classification of exam questions and learning outcomes according to Bloom's Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom's Taxonomy classification.
翻译:本文探讨了依据布鲁姆分类法对考试题目和学习成果进行自动分类的方法。研究使用了一个包含600个句子的标注数据集,这些句子被标记为六个认知类别——知识、理解、应用、分析、综合和评价。我们采用传统机器学习模型(朴素贝叶斯、逻辑回归、支持向量机)、循环神经网络架构(LSTM、BiLSTM、GRU、BiGRU)、基于Transformer的模型(BERT和RoBERTa)以及大型语言模型(OpenAI、Gemini、Ollama、Anthropic)对数据集进行处理。每种模型在不同的预处理和数据增强策略(如同义词替换、词嵌入等)下进行评估。在传统机器学习方法中,采用数据增强的支持向量机(SVM)取得了最佳整体性能,准确率、召回率和F1分数均达到94%,且过拟合程度最低。相比之下,RNN模型和BERT出现了严重的过拟合,而RoBERTa最初克服了这一问题,但随着训练进行开始显现过拟合迹象。最后,对大型语言模型(LLMs)的零样本评估表明,在测试的LLMs中,OpenAI和Gemini表现最佳,准确率约为0.72-0.73,F1分数也相当。这些发现凸显了在有限数据上训练复杂深度模型所面临的挑战,并强调了精心设计的数据增强和更简单的算法(如增强型SVM)在布鲁姆分类法分类中的价值。