ToxSyn：通过合成巴西葡萄牙语少数群体数据减少仇恨言论检测中的偏见 (ToxSyn: Reducing Bias in Hate Speech Detection via Synthetic Minority Data in Brazilian Portuguese)

The development of robust hate speech detection systems remains limited by the lack of large-scale, fine-grained training data, especially for languages beyond English. Existing corpora typically rely on coarse toxic/non-toxic labels, and the few that capture hate directed at specific minority groups critically lack the non-toxic counterexamples (i.e., benign text about minorities) required to distinguish genuine hate from mere discussion. We introduce ToxSyn, the first Portuguese large-scale corpus explicitly designed for multi-label hate speech detection across nine protected minority groups. Generated via a controllable four-stage pipeline, ToxSyn includes discourse-type annotations to capture rhetorical strategies of toxic language, such as sarcasm or dehumanization. Crucially, it systematically includes the non-toxic counterexamples absent in all other public datasets. Our experiments reveal a catastrophic, mutual generalization failure between social-media domains and ToxSyn: models trained on social media struggle to generalize to minority-specific contexts, and vice-versa. This finding indicates they are distinct tasks and exposes summary metrics like Macro F1 can be unreliable indicators of true model behavior, as they completely mask model failure. We publicly release ToxSyn at HuggingFace to foster reproducible research on synthetic data generation and benchmark progress in hate-speech detection for low- and mid-resource languages.

翻译：由于缺乏大规模、细粒度的训练数据，尤其是在英语以外的语言中，稳健的仇恨言论检测系统的发展仍然受限。现有语料库通常依赖于粗粒度的有毒/无毒标签，而少数能够捕捉针对特定少数群体仇恨的语料库严重缺乏区分真实仇恨与单纯讨论所需的非毒性反例（即关于少数群体的良性文本）。我们引入了ToxSyn，这是首个为九个受保护少数群体进行多标签仇恨言论检测而明确设计的葡萄牙语大规模语料库。通过可控的四阶段流程生成，ToxSyn包含话语类型标注，以捕捉有毒语言的修辞策略，如讽刺或非人化。至关重要的是，它系统性地包含了所有其他公共数据集中缺失的非毒性反例。我们的实验揭示了社交媒体领域与ToxSyn之间存在灾难性的相互泛化失败：在社交媒体上训练的模型难以泛化到少数群体特定语境，反之亦然。这一发现表明它们是不同的任务，并暴露了像Macro F1这样的汇总指标可能是真实模型行为的不可靠指标，因为它们完全掩盖了模型失败。我们在HuggingFace上公开发布ToxSyn，以促进关于合成数据生成的可重复研究，并为中低资源语言的仇恨言论检测提供基准进展。