Observation of classroom interactions can provide concrete feedback to teachers, but current methods rely on manual annotation, which is resource-intensive and hard to scale. This work explores AI-driven analysis of classroom recordings, focusing on multimodal instructional activity and discourse recognition as a foundation for actionable feedback. Using a densely annotated dataset of 164 hours of video and 68 lesson transcripts, we design parallel, modality-specific pipelines. For video, we evaluate zero-shot multimodal LLMs, fine-tuned vision-language models, and self-supervised video transformers on 24 activity labels. For transcripts, we fine-tune a transformer-based classifier with contextualized inputs and compare it against prompting-based LLMs on 19 discourse labels. To handle class imbalance and multi-label complexity, we apply per-label thresholding, context windows, and imbalance-aware loss functions. The results show that fine-tuned models consistently outperform prompting-based approaches, achieving macro-F1 scores of 0.577 for video and 0.460 for transcripts. These results demonstrate the feasibility of automated classroom analysis and establish a foundation for scalable teacher feedback systems.
翻译:课堂互动观察可为教师提供具体反馈,但现有方法依赖人工标注,资源消耗大且难以规模化。本研究探索利用人工智能分析课堂录像,重点关注多模态教学行为与话语识别,作为生成可操作反馈的基础。通过使用一个包含164小时视频和68节课转录文本的密集标注数据集,我们设计了并行的、模态特定的处理流程。针对视频数据,我们在24个行为标签上评估了零样本多模态大语言模型、微调的视觉-语言模型以及自监督视频Transformer模型。针对转录文本,我们微调了一个基于Transformer的、具有上下文输入的分类器,并在19个话语标签上将其与基于提示的大语言模型进行了比较。为处理类别不平衡和多标签复杂性,我们应用了按标签阈值设定、上下文窗口以及考虑不平衡的损失函数。结果表明,微调模型在性能上持续优于基于提示的方法,在视频和转录文本上分别取得了0.577和0.460的宏平均F1分数。这些结果证明了课堂自动分析的可行性,并为可扩展的教师反馈系统奠定了基础。