Chipmink：面向海量对象图的高效增量识别 (Chipmink: Efficient Delta Identification for Massive Object Graph)

Ranging from batch scripts to computational notebooks, modern data science tools rely on massive and evolving object graphs that represent structured data, models, plots, and more. Persisting these objects is critical, not only to enhance system robustness against unexpected failures but also to support continuous, non-linear data exploration via versioning. Existing object persistence mechanisms (e.g., Pickle, Dill) rely on complete snapshotting, often redundantly storing unchanged objects during execution and exploration, resulting in significant inefficiency in both time and storage. Unlike DBMSs, data science systems lack centralized buffer managers that track dirty objects. Worse, object states span various locations such as memory heaps, shared memory, GPUs, and remote machines, making dirty object identification fundamentally more challenging. In this work, we propose a graph-based object store, named Chipmink, that acts like the centralized buffer manager. Unlike static pages in DBMSs, persistence units in Chipmink are dynamically induced by partitioning objects into appropriate subgroups (called pods), minimizing expected persistence costs based on object sizes and reference structure. These pods effectively isolate dirty objects, enabling efficient partial persistence. Our experiments show that Chipmink is general, supporting libraries that rely on shared memory, GPUs, and remote objects. Moreover, Chipmink achieves up to 36.5x smaller storage sizes and 12.4x faster persistence than the best baselines in real-world notebooks and scripts.

翻译：从批处理脚本到计算型笔记本，现代数据科学工具依赖于表示结构化数据、模型、图表等的海量且不断演化的对象图。持久化这些对象至关重要，不仅可提升系统在意外故障下的鲁棒性，还能通过版本控制支持连续、非线性的数据探索。现有对象持久化机制（如Pickle、Dill）依赖于完整快照，常在执行和探索过程中冗余存储未更改的对象，导致时间和存储效率显著低下。与数据库管理系统不同，数据科学系统缺乏用于追踪脏对象的集中式缓冲区管理器。更严重的是，对象状态分布于内存堆、共享内存、GPU及远程机器等多种位置，使得脏对象识别从根本上更具挑战性。本研究提出一种基于图的对象存储系统Chipmink，其功能类似于集中式缓冲区管理器。与数据库管理系统中静态页面不同，Chipmink中的持久化单元通过将对象动态划分为适当子组（称为pod）来构建，依据对象大小和引用结构最小化预期持久化成本。这些pod能有效隔离脏对象，实现高效的部分持久化。实验表明，Chipmink具有通用性，可支持依赖共享内存、GPU及远程对象的库。此外，在实际笔记本和脚本场景中，Chipmink相比最佳基线方法实现了高达36.5倍的存储空间缩减和12.4倍的持久化速度提升。