LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language Models - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · 知识 (knowledge) · Performer · Extensibility ·

2023 年 5 月 25 日

LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language Models

翻译：暂无翻译

Patrik Puchert,Poonam Poonam,Christian van Onzenoodt,Timo Ropinski

Large Language Models (LLMs) have revolutionized natural language processing and demonstrated impressive capabilities in various tasks. Unfortunately, they are prone to hallucinations, where the model exposes incorrect or false information in its responses, which renders diligent evaluation approaches mandatory. While LLM performance in specific knowledge fields is often evaluated based on question and answer (Q&A) datasets, such evaluations usually report only a single accuracy number for the entire field, a procedure which is problematic with respect to transparency and model improvement. A stratified evaluation could instead reveal subfields, where hallucinations are more likely to occur and thus help to better assess LLMs' risks and guide their further development. To support such stratified evaluations, we propose LLMMaps as a novel visualization technique that enables users to evaluate LLMs' performance with respect to Q&A datasets. LLMMaps provide detailed insights into LLMs' knowledge capabilities in different subfields, by transforming Q&A datasets as well as LLM responses into our internal knowledge structure. An extension for comparative visualization furthermore, allows for the detailed comparison of multiple LLMs. To assess LLMMaps we use them to conduct a comparative analysis of several state-of-the-art LLMs, such as BLOOM, GPT-2, GPT-3, ChatGPT and LLaMa-13B, as well as two qualitative user evaluations. All necessary source code and data for generating LLMMaps to be used in scientific publications and elsewhere will be available on GitHub.

翻译：暂无翻译

0

相关内容

语言模型化

语言模型化

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

TRAP1在赭曲霉毒素A干扰肾细胞凋亡与自噬内稳态中的作用机制

国家自然科学基金

0+阅读 · 2014年12月31日

Prohibitin1在胆管癌中的作用及分子机制

国家自然科学基金

0+阅读 · 2013年12月31日

CD147参与AR调控雄激素非依赖性前列腺癌的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

IRES调控EV71神经毒性的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

纳米杂化双光子吸收无机功能材料的构筑与机理研究

国家自然科学基金

0+阅读 · 2011年12月31日

Certified Robustness for Large Language Models with Self-Denoising

Arxiv

0+阅读 · 2023年7月14日

Large Language Models for Supply Chain Optimization

Arxiv

0+阅读 · 2023年7月13日

A Survey on Evaluation of Large Language Models

Arxiv

0+阅读 · 2023年7月13日

A Comprehensive Overview of Large Language Models

Arxiv

21+阅读 · 2023年7月12日

A Survey of Natural Language Generation

Arxiv

15+阅读 · 2021年12月22日

VIP会员

文章信息

相关主题

语言模型化

知识 (knowledge)

相关VIP内容

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】面向真实世界音视联合语音识别的可扩展框架

《通过仿真与开源数据提升战略决策：机遇与局限》最新报告

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

评估大语言模型在科学发现中的作用

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

Certified Robustness for Large Language Models with Self-Denoising

Arxiv

0+阅读 · 2023年7月14日

Large Language Models for Supply Chain Optimization

Arxiv

0+阅读 · 2023年7月13日

A Survey on Evaluation of Large Language Models

Arxiv

0+阅读 · 2023年7月13日

A Comprehensive Overview of Large Language Models

Arxiv

21+阅读 · 2023年7月12日

A Survey of Natural Language Generation

Arxiv

15+阅读 · 2021年12月22日

相关基金

TRAP1在赭曲霉毒素A干扰肾细胞凋亡与自噬内稳态中的作用机制

国家自然科学基金

0+阅读 · 2014年12月31日

Prohibitin1在胆管癌中的作用及分子机制

国家自然科学基金

0+阅读 · 2013年12月31日

CD147参与AR调控雄激素非依赖性前列腺癌的作用及机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

IRES调控EV71神经毒性的分子机理研究

国家自然科学基金

0+阅读 · 2012年12月31日

纳米杂化双光子吸收无机功能材料的构筑与机理研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员