Sentence embedding models aim to provide general purpose embeddings for sentences. Most of the models studied in this paper claim to perform well on STS tasks - but they do not report on their suitability for clustering. This paper looks at four recent sentence embedding models (Universal Sentence Encoder (Cer et al., 2018), Sentence-BERT (Reimers and Gurevych, 2019), LASER (Artetxe and Schwenk, 2019), and DeCLUTR (Giorgi et al., 2020)). It gives a brief overview of the ideas behind their implementations. It then investigates how well topic classes in two text classification datasets (Amazon Reviews (Ni et al., 2019) and News Category Dataset (Misra, 2018)) map to clusters in their corresponding sentence embedding space. While the performance of the resulting classification model is far from perfect, it is better than random. This is interesting because the classification model has been constructed in an unsupervised way. The topic classes in these real life topic classification datasets can be partly reconstructed by clustering the corresponding sentence embeddings.
翻译:嵌入判决模式的目的是为判决提供一般目的嵌入。本文件所研究的大多数模式都声称在STS任务中表现良好,但并不报告是否适合分组。本文审视了最近四个嵌入判决模式的嵌入模式(Universal Polden Eccoder(Cer等人,2018年)、判决-BERT(Reimers和Gurevych,2019年)、LASER(Artexe和Schwenk,2019年)和DECLUTR(Giorgi等人,2020年) 。它简要概述了执行这些模式背后的想法。它随后调查了两个文本分类数据集(Amazon Revi等人,2019年)和News Gelge Datasset(Misra,2018年))的组合图在相应的句内嵌入空间中的组合中的各个主题类别。虽然由此形成的分类模型的性能远非完美,但优于随机性。这很有趣,因为分类模型是以未受监督的方式构建的。这些真实生活分类数据集的专题分类可以部分通过组合来重建。