We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The early study of AMR trained the model with solely synthetic datasets. Moreover, the evaluation is based on annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1,009, 213, and 640 audio recordings for train, valid, and test split, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in https://h-munakata.github.io/CASTELLA-demo/.
翻译:我们介绍了CASTELLA,一个用于音频片段检索(AMR)任务的人工标注音频基准。尽管AMR具有多种潜在的实际应用价值,但目前仍缺乏基于真实世界数据的成熟基准。AMR的早期研究仅使用合成数据集训练模型。此外,评估基于少于100个样本的标注数据集,导致报告的性能可靠性较低。为确保在真实环境应用中的性能,我们提出了CASTELLA,一个大规模人工标注的AMR数据集。CASTELLA包含分别用于训练、验证和测试的1,009、213和640个音频记录,规模是先前数据集的24倍。我们还利用CASTELLA建立了AMR的基线模型。实验表明,在合成数据预训练后使用CASTELLA微调的模型,在Recall1@0.7指标上比仅使用合成数据训练的模型高出10.4个百分点。CASTELLA已在https://h-munakata.github.io/CASTELLA-demo/公开提供。