This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.
翻译:本文提出一种基于控制理论的框架,用于对齐大型语言模型(LLMs),通过利用控制屏障函数(CBF)确保生成符合用户期望的文本。该框架将CBF安全过滤器应用于基线LLM生成的预测令牌,以干预生成文本。该安全过滤器具有两大显著优势:首先,它是一种附加型设计,无需对基线LLM进行微调即可用于对齐目的;其次,若存在针对期望对齐目标的评估模型,可直接将其应用于过滤器设计。整个文本生成系统基于开源语言模型实现,旨在生成积极正向的文本。