洞察未知：分子数据的联邦数据多样性分析 (Insights into the Unknown: Federated Data Diversity Analysis on Molecular Data)

AI methods are increasingly shaping pharmaceutical drug discovery. However, their translation to industrial applications remains limited due to their reliance on public datasets, lacking scale and diversity of proprietary pharmaceutical data. Federated learning (FL) offers a promising approach to integrate private data into privacy-preserving, collaborative model training across data silos. This federated data access complicates important data-centric tasks such as estimating dataset diversity, performing informed data splits, and understanding the structure of the combined chemical space. To address this gap, we investigate how well federated clustering methods can disentangle and represent distributed molecular data. We benchmark three approaches, Federated kMeans (Fed-kMeans), Federated Principal Component Analysis combined with Fed-kMeans (Fed-PCA+Fed-kMeans), and Federated Locality-Sensitive Hashing (Fed-LSH), against their centralized counterparts on eight diverse molecular datasets. Our evaluation utilizes both, standard mathematical and a chemistry-informed evaluation metrics, SF-ICF, that we introduce in this work. The large-scale benchmarking combined with an in-depth explainability analysis shows the importance of incorporating domain knowledge through chemistry-informed metrics, and on-client explainability analyses for federated diversity analysis on molecular data.

翻译：人工智能方法正日益塑造药物发现领域。然而，由于依赖公开数据集，这些方法向工业应用的转化仍然有限，缺乏专有制药数据的规模和多样性。联邦学习（FL）提供了一种有前景的方法，可将私有数据整合到跨数据孤岛的隐私保护协作模型训练中。这种联邦数据访问使得以数据为中心的重要任务变得复杂，例如估计数据集多样性、执行知情数据划分以及理解组合化学空间的结构。为弥补这一空白，我们研究了联邦聚类方法在解构和表征分布式分子数据方面的能力。我们在八个不同的分子数据集上，将三种方法——联邦k均值（Fed-kMeans）、联邦主成分分析与联邦k均值结合（Fed-PCA+Fed-kMeans）以及联邦局部敏感哈希（Fed-LSH）——与其集中式对应方法进行了基准测试。我们的评估同时采用了标准数学指标和一种基于化学知识的评估指标SF-ICF，该指标由我们在本工作中提出。大规模基准测试结合深入的可解释性分析表明，在分子数据的联邦多样性分析中，通过基于化学知识的指标以及客户端可解释性分析来融入领域知识至关重要。