Deep generative models have recently achieved impressive performance in speech synthesis and music generation. However, compared to the generation of those domain-specific sounds, the generation of general sounds (such as car horn, dog barking, and gun shot) has received less attention, despite their wide potential applications. In our previous work, sounds are generated in the time domain using SampleRNN. However, it is difficult to capture long-range dependencies within sound recordings using this method. In this work, we propose to generate sounds conditioned on sound classes via neural discrete time-frequency representation learning. This offers an advantage in modelling long-range dependencies and retaining local fine-grained structure within a sound clip. We evaluate our proposed approach on the UrbanSound8K dataset, as compared to a SampleRNN baseline, with the performance metrics measuring the quality and diversity of the generated sound samples. Experimental results show that our proposed method offers significantly better performance in diversity and comparable performance in quality, as compared to the baseline method.
翻译:深层基因模型最近在语音合成和音乐生成方面取得了令人印象深刻的成绩,然而,与产生这些特定领域的声音相比,生成一般声音(如汽车角、狗叫和枪弹射击)尽管具有广泛的潜在应用,但受到的关注却较少。在先前的工作中,声音是在使用采样RNN的时域生成的。然而,很难在使用这种方法的录音录音中捕捉到长距离的依赖性。在这项工作中,我们提议通过神经离散的时间频率显示学习来生成声音,以音频教室为条件。这在制作远距离依赖性模型和将本地精密结构保留在音短片中方面提供了优势。我们评估了我们关于城市Sound8K数据集的拟议方法,与抽样RNNN基线相比,我们用测量生成声音样品质量和多样性的性能指数来衡量了我们提出的方法,实验结果表明,与基线方法相比,我们提出的方法在多样性和质量上的表现和可比的性业绩要好得多。