大型语言模型能否检测科学新闻报道中的错误信息？ (Can Large Language Models Detect Misinformation in Scientific News Reporting?)

Scientific facts are often spun in the popular press with the intent to influence public opinion and action, as was evidenced during the COVID-19 pandemic. Automatic detection of misinformation in the scientific domain is challenging because of the distinct styles of writing in these two media types and is still in its nascence. Most research on the validity of scientific reporting treats this problem as a claim verification challenge. In doing so, significant expert human effort is required to generate appropriate claims. Our solution bypasses this step and addresses a more real-world scenario where such explicit, labeled claims may not be available. The central research question of this paper is whether it is possible to use large language models (LLMs) to detect misinformation in scientific reporting. To this end, we first present a new labeled dataset SciNews, containing 2.4k scientific news stories drawn from trusted and untrustworthy sources, paired with related abstracts from the CORD-19 database. Our dataset includes both human-written and LLM-generated news articles, making it more comprehensive in terms of capturing the growing trend of using LLMs to generate popular press articles. Then, we identify dimensions of scientific validity in science news articles and explore how this can be integrated into the automated detection of scientific misinformation. We propose several baseline architectures using LLMs to automatically detect false representations of scientific findings in the popular press. For each of these architectures, we use several prompt engineering strategies including zero-shot, few-shot, and chain-of-thought prompting. We also test these architectures and prompting strategies on GPT-3.5, GPT-4, and Llama2-7B, Llama2-13B.

翻译：科学事实在流行媒体中常被曲解，意图影响公众观点与行动，这在COVID-19大流行期间尤为明显。由于科学文献与新闻报道在写作风格上存在显著差异，科学领域错误信息的自动检测仍处于起步阶段，具有挑战性。现有关于科学报道有效性的研究多将此问题视为声明验证任务，但该方法需耗费大量专家人力生成相应声明。我们的方案绕过这一步骤，针对更现实的场景——即此类明确标注的声明可能无法获取的情况展开研究。本文的核心研究问题是：能否利用大型语言模型（LLMs）检测科学报道中的错误信息？为此，我们首先构建了一个新的标注数据集SciNews，包含从可信与不可信来源收集的2400篇科学新闻报道，并与CORD-19数据库中的相关摘要配对。该数据集同时涵盖人工撰写和LLM生成的新闻文章，以更全面地反映使用LLMs生成流行媒体文章的增长趋势。其次，我们界定了科学新闻文章中的有效性维度，并探索如何将其整合到科学错误信息的自动检测中。我们提出了多种基于LLMs的基线架构，用于自动检测流行媒体中对科学发现的虚假表述。针对每种架构，我们采用了包括零样本、少样本和思维链提示在内的多种提示工程策略，并在GPT-3.5、GPT-4、Llama2-7B和Llama2-13B模型上对这些架构与提示策略进行了测试。