基于变分自编码器与时间卷积网络的弱标记生物声学鲸类数据集时序特征学习：一种跨学科方法 (Temporal Feature Learning in Weakly Labelled Bioacoustic Cetacean Datasets via a Variational Autoencoder and Temporal Convolutional Network: An Interdisciplinary Approach)

Temporal Feature Learning in Weakly Labelled Bioacoustic Cetacean Datasets via a Variational Autoencoder and Temporal Convolutional Network: An Interdisciplinary Approach

翻译：基于变分自编码器与时间卷积网络的弱标记生物声学鲸类数据集时序特征学习：一种跨学科方法

Laia Garrobé Fonollosa,Douglas Gillespie,Lina Stankovic,Vladimir Stankovic,Luke Rendell

Bioacoustics data from Passive acoustic monitoring (PAM) poses a unique set of challenges for classification, particularly the limited availability of complete and reliable labels in datasets due to annotation uncertainty, biological complexity due the heterogeneity in duration of cetacean vocalizations, and masking of target sounds due to environmental and anthropogenic noise. This means that data is often weakly labelled, with annotations indicating presence/absence of species over several minutes. In order to effectively capture the complex temporal patterns and key features of lengthy continuous audio segments, we propose an interdisciplinary framework comprising dataset standardisation, feature extraction via Variational Autoencoders (VAE) and classification via Temporal Convolutional Networks (TCN). This approach eliminates the necessity for manual threshold setting or time-consuming strong labelling. To demonstrate the effectiveness of our approach, we use sperm whale (<i>Physeter macrocephalus</i>) click trains in 4-minute recordings as a case study, from a dataset comprising diverse sources and deployment conditions to maximise generalisability. The value of feature extraction via the VAE is demonstrated by comparing classification performance against the traditional and explainable approach of expert handpicking of features. The TCN demonstrated robust classification capabilities achieving AUC scores exceeding 0.9.

翻译：被动声学监测（PAM）获取的生物声学数据在分类任务中面临一系列独特挑战，主要包括：因标注不确定性导致数据集中完整可靠标签的稀缺性；鲸类发声时长异质性带来的生物学复杂性；以及环境与人为噪声对目标声音的掩蔽效应。这导致数据常呈现弱标记状态，即标注仅指示数分钟内物种的有无。为有效捕捉长时连续音频片段中复杂的时序模式与关键特征，我们提出一种跨学科框架，包含数据集标准化、通过变分自编码器（VAE）的特征提取以及基于时间卷积网络（TCN）的分类。该方法无需手动设定阈值或耗时的人工强标注。为验证方法的有效性，我们以抹香鲸（Physeter macrocephalus）在4分钟录音中的咔嗒声序列为案例进行研究，所用数据集涵盖多源采集与不同布放条件以最大化泛化能力。通过将VAE特征提取的分类性能与专家人工选取特征的传统可解释方法进行对比，证明了VAE特征提取的价值。TCN展现出稳健的分类能力，其AUC分数超过0.9。