This paper proposes to use both audio input and subject information to predict the personalized preference of two audio segments with the same content in different qualities. A siamese network is used to compare the inputs and predict the preference. Several different structures for each side of the siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder and a multi-layer perceptron block as the decoder outperforms a baseline model using only audio input the most, where the overall accuracy grows from 77.56% to 78.04%. Experimental results also show that using all the subject information, including age, gender, and the specifications of headphones or earphones, is more effective than using only a part of them.
翻译:本文建议使用音频输入和主题信息来预测具有不同品质的相同内容的两个音频段的个人偏好。 使用一个 Siames 网络来比较输入和预测偏好。 调查了Siamees 网络各侧的若干不同结构, 使用PANNs CNN6作为编码器的LDNet和一个多层透视器块, 因为解码器只使用音频输入, 其总精度从77.56%增加到78.04%, 并且实验结果还表明,使用所有主题信息, 包括年龄、性别以及耳机或耳机的规格, 比仅使用其中的一部分更有效。