SkipGram word embedding models with negative sampling, or SGN in short, is an elegant family of word embedding models. In this paper, we formulate a framework for word embedding, referred to as Word-Context Classification (WCC), that generalizes SGN to a wide family of models. The framework, which uses some ``noise examples'', is justified through theoretical analysis. The impact of noise distribution on the learning of the WCC embedding models is studied experimentally, suggesting that the best noise distribution is, in fact, the data distribution, in terms of both the embedding performance and the speed of convergence during training. Along our way, we discover several novel embedding models that outperform existing WCC models.
翻译:带负采样的SkipGram词嵌入模型(简称SGN)是一类优雅的词嵌入模型。本文提出一个称为词-上下文分类(WCC)的词嵌入框架,将SGN推广至更广泛的模型家族。该框架通过使用若干“噪声样本”得到理论分析的验证。我们通过实验研究了噪声分布对WCC嵌入模型学习过程的影响,结果表明:在嵌入性能与训练收敛速度方面,最优噪声分布实际上是数据分布本身。在研究过程中,我们发现了若干优于现有WCC模型的新型嵌入模型。