Recently, offline reinforcement learning (RL) has become a popular RL paradigm. In offline RL, data providers share pre-collected datasets -- either as individual transitions or sequences of transitions forming trajectories -- to enable the training of RL models (also called agents) without direct interaction with the environments. Offline RL saves interactions with environments compared to traditional RL, and has been effective in critical areas, such as navigation tasks. Meanwhile, concerns about privacy leakage from offline RL datasets have emerged. To safeguard private information in offline RL datasets, we propose the first differential privacy (DP) offline dataset synthesis method, PrivORL, which leverages a diffusion model and diffusion transformer to synthesize transitions and trajectories, respectively, under DP. The synthetic dataset can then be securely released for downstream analysis and research. PrivORL adopts the popular approach of pre-training a synthesizer on public datasets, and then fine-tuning on sensitive datasets using DP Stochastic Gradient Descent (DP-SGD). Additionally, PrivORL introduces curiosity-driven pre-training, which uses feedback from the curiosity module to diversify the synthetic dataset and thus can generate diverse synthetic transitions and trajectories that closely resemble the sensitive dataset. Extensive experiments on five sensitive offline RL datasets show that our method achieves better utility and fidelity in both DP transition and trajectory synthesis compared to baselines. The replication package is available at the GitHub repository.
翻译:近年来,离线强化学习已成为一种主流的强化学习范式。在离线强化学习中,数据提供方共享预先收集的数据集——既可以是单个状态转移,也可以是构成轨迹的转移序列——以实现在无需与环境直接交互的情况下训练强化学习模型(亦称为智能体)。与传统强化学习相比,离线强化学习减少了与环境的交互次数,并在导航任务等关键领域展现出显著效果。与此同时,离线强化学习数据集可能引发的隐私泄露问题也日益受到关注。为保护离线强化学习数据集中的隐私信息,我们提出了首个基于差分隐私的离线数据集合成方法PrivORL。该方法利用扩散模型和扩散Transformer,分别在差分隐私约束下合成状态转移和轨迹数据。生成的合成数据集可安全发布用于下游分析与研究。PrivORL采用在公共数据集上预训练合成器,再通过差分隐私随机梯度下降在敏感数据集上微调的通用策略。此外,PrivORL引入了好奇心驱动预训练机制,通过好奇心模块的反馈增强合成数据集的多样性,从而生成与敏感数据集高度相似且多样化的合成转移数据与轨迹。在五个敏感离线强化学习数据集上的大量实验表明,相较于基线方法,我们的方法在差分隐私状态转移合成与轨迹合成方面均实现了更优的效用性与保真度。复现资源已发布于GitHub代码库。