Large language models (LLMs) increasingly serve as human-like decision-making agents in social science and applied settings. These LLM-agents are typically assigned human-like characters and placed in real-life contexts. However, how these characters and contexts shape an LLM's behavior remains underexplored. This study proposes and tests methods for probing, quantifying, and modifying an LLM's internal representations in a Dictator Game -- a classic behavioral experiment on fairness and prosocial behavior. We extract "vectors of variable variations" (e.g., "male" to "female") from the LLM's internal state. Manipulating these vectors during the model's inference can substantially alter how those variables relate to the model's decision-making. This approach offers a principled way to study and regulate how social concepts can be encoded and engineered within transformer-based models, with implications for alignment, debiasing, and designing AI agents for social simulations in both academic and commercial applications, strengthening sociological theory and measurement.
翻译:大语言模型(LLMs)在社会科学和应用场景中日益扮演类人决策代理的角色。这些LLM代理通常被赋予类人特征并置于现实生活情境中。然而,这些特征与情境如何影响LLM的行为机制仍缺乏深入探究。本研究提出并验证了在独裁者博弈——一个关于公平性与亲社会行为的经典行为实验中——探测、量化和修改LLM内部表征的方法。我们从LLM的内部状态中提取“变量变异向量”(例如从“男性”到“女性”)。在模型推理过程中操纵这些向量可显著改变这些变量与模型决策的关联性。该方法为研究及调控基于Transformer的模型中社会概念的编码与工程化提供了理论框架,对模型对齐、去偏见以及学术与商业应用中社会模拟AI代理的设计具有启示意义,同时强化了社会学理论与测量体系。