面向大语言模型事件序列建模的时间标记化策略 (Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models)

Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents the first empirical study of temporal tokenization for event sequences, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, with log-based strategies excelling on skewed distributions and human-centric formats proving robust for mixed modalities.

翻译：在利用大语言模型（LLMs）对时序事件序列进行建模时，连续时间的表示是一个关键且尚未充分探索的挑战。已有研究提出了字节级表示或日历标记等多种策略。然而，考虑到现实世界事件数据从平滑的对数正态分布到离散、尖峰模式等多样化的统计分布，最优方法仍不明确。本文首次对事件序列的时间标记化进行了实证研究，比较了不同的编码策略：朴素数字字符串、高精度字节级表示、人类语义日历标记、经典均匀分箱以及自适应残差标量量化。我们通过在体现这些多样化分布的真实世界数据集上微调LLMs来评估这些策略。分析表明，没有单一策略具有普适优势；相反，预测性能在很大程度上取决于标记化器与数据统计特性的对齐程度，其中基于对数的策略在偏态分布上表现优异，而以人类为中心的格式在混合模态中展现出鲁棒性。