Datasets may contain observations with multiple labels. If the labels are not mutually exclusive, and if the labels vary greatly in frequency, obtaining a sample that includes sufficient observations with scarcer labels to make inferences about those labels, and which deviates from the population frequencies in a known manner, creates challenges. In this paper, we consider a multivariate Bernoulli distribution as our underlying distribution of a multi-label problem. We present a novel sampling algorithm that takes label dependencies into account. It uses observed label frequencies to estimate multivariate Bernoulli distribution parameters and calculate weights for each label combination. This approach ensures the weighted sampling acquires target distribution characteristics while accounting for label dependencies. We applied this approach to a sample of research articles from Web of Science labeled with 64 biomedical topic categories. We aimed to preserve category frequency order, reduce frequency differences between most and least common categories, and account for category dependencies. This approach produced a more balanced sub-sample, enhancing the representation of minority categories.
翻译:数据集中可能包含具有多个标签的观测样本。若标签之间非互斥,且标签频率差异显著,则获取一个既包含足够稀缺标签观测以支持相关推断、又能以已知方式偏离总体频率的样本面临挑战。本文采用多元伯努利分布作为多标签问题的底层分布模型,提出一种考虑标签依赖关系的新型抽样算法。该算法利用观测到的标签频率估计多元伯努利分布参数,并计算各标签组合的权重。该方法通过加权抽样在考虑标签依赖性的同时,确保获得目标分布特征。我们将此方法应用于从Web of Science获取的标注有64个生物医学主题类别的研究文献样本,旨在保持类别频率顺序、缩小最常见与最罕见类别间的频率差异,并兼顾类别依赖关系。该方法生成了更均衡的子样本,有效提升了少数类别的表征能力。