We present repliclust (from repli-cate and clust-er), a Python package for generating synthetic data sets with clusters. Our approach is based on data set archetypes, high-level geometric descriptions from which the user can create many different data sets, each possessing the desired geometric characteristics. The architecture of our software is modular and object-oriented, decomposing data generation into algorithms for placing cluster centers, sampling cluster shapes, selecting the number of data points for each cluster, and assigning probability distributions to clusters. The project webpage, repliclust.org, provides a concise user guide and thorough documentation.
翻译:我们提出了repliclust(来自repli-cate和clust-er),这是一个Python软件包,用于生成具有聚类的合成数据集。我们的方法基于数据集原型,这些原型是从中可以创建许多具有所需几何特征的不同数据集的高级几何描述。我们软件的结构是模块化的和面向对象的,将数据生成分解为放置群集中心的算法、采样群集形状、选择每个群集的数据点数量以及为群集分配概率分布。项目网页repliclust.org提供了简明的用户指南和全面的文档。