Deploying Large Language Model (LLM) services at the edge benefits latency-sensitive and privacy-aware applications. However, the stateless nature of LLMs makes managing user context (e.g., sessions, preferences) across geo-distributed edge nodes challenging. Existing solutions, such as client-side context storage, often introduce network latency and bandwidth overhead, undermining the advantages of edge deployment. We propose DisCEdge, a distributed context management system that stores and replicates user context in tokenized form across edge nodes. By maintaining context as token sequences rather than raw text, our system avoids redundant computation and enables efficient data replication. We implement and evaluate an open-source prototype in a realistic edge environment with commodity hardware. We show DisCEdge improves median response times by up to 14.46% and lowers median inter-node synchronization overhead by up to 15% compared to a raw-text-based system. It also reduces client request sizes by a median of 90% compared to client-side context management, while guaranteeing data consistency.
翻译:在边缘部署大型语言模型(LLM)服务有利于对延迟敏感和注重隐私的应用。然而,LLM的无状态特性使得在跨地理分布的边缘节点间管理用户上下文(例如会话、偏好)具有挑战性。现有解决方案,如客户端上下文存储,通常会引入网络延迟和带宽开销,从而削弱边缘部署的优势。我们提出DisCEdge,一种分布式上下文管理系统,它以令牌化形式在边缘节点间存储和复制用户上下文。通过将上下文维护为令牌序列而非原始文本,我们的系统避免了冗余计算并实现了高效的数据复制。我们在一个由商用硬件构建的真实边缘环境中实现并评估了一个开源原型。实验表明,与基于原始文本的系统相比,DisCEdge将中位数响应时间提升了最高14.46%,并将中位数节点间同步开销降低了最高15%。与客户端上下文管理方案相比,它还将客户端请求大小的中位数减少了90%,同时保证了数据一致性。