Building a benchmark dataset for hate speech detection presents several challenges. Firstly, because hate speech is relatively rare -- e.g., less than 3\% of Twitter posts are hateful \citep{founta2018large} -- random sampling of tweets to annotate is inefficient in capturing hate speech. A common practice is to only annotate tweets containing known ``hate words'', but this risks yielding a biased benchmark that only partially captures the real-world phenomenon of interest. A second challenge is that definitions of hate speech tend to be highly variable and subjective. Annotators having diverse prior notions of hate speech may not only disagree with one another but also struggle to conform to specified labeling guidelines. Our key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR). This connection suggests that well-established methodologies for creating IR test collections might also be usefully applied to create better benchmark datasets for hate speech detection. Firstly, to intelligently and efficiently select which tweets to annotate, we apply established IR techniques of {\em pooling} and {\em active learning}. Secondly, to improve both consistency and value of annotations, we apply {\em task decomposition} \cite{Zhang-sigir14} and {\em annotator rationale} \cite{mcdonnell16-hcomp} techniques. Using the above techniques, we create and share a new benchmark dataset\footnote{We will release the dataset upon publication.} for hate speech detection with broader coverage than prior datasets. We also show a dramatic drop in accuracy of existing detection models when tested on these broader forms of hate. Collected annotator rationales not only provide documented support for labeling decisions but also create exciting future work opportunities for dual-supervision and/or explanation generation in modeling.
翻译:首先,由于仇恨言论的定义相对而言很少 -- -- 例如,在推特上只有不到3 ⁇ 的仇恨言论是令人憎恶的 \ citep{founta2018 grand} -- -- 将推特随机取样到注解中是无效的。通常的做法是只对含有已知的“仇恨单词”的推文进行注解,但这种风险会产生偏颇的基准,只能部分地捕捉真实世界的仇恨现象。第二个挑战是,仇恨言论的定义往往具有高度的可变性和主观性。具有不同先前仇恨言论概念的告示者不仅可能相互不同,而且很难遵守指定的标签准则。我们的主要洞察觉是,仇恨言论的易读性和主观性与信息检索中的关联性(IR)相近。这种联系表明,建立IR测试收藏的既定方法可能只用于为仇恨言论的检测创建更好的基准数据集。首先,更聪明和高效地选择未来推文的推介,我们用固定的 IR 方法联合起来计算仇恨言论的评析范围, 并且积极学习前期数据。