Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.
翻译:基于数据驱动的心电图(ECG)解读方法正在迅速发展。大规模数据集推动了人工智能(AI)在心电图分析领域的进步,但标注质量、数据规模和覆盖范围的限制仍是主要挑战。本文介绍了CODE-II,这是一个由巴西米纳斯吉拉斯州远程医疗网络(TNMG)收集的大规模真实世界数据集,包含来自2,093,807名成年患者的2,735,269份12导联心电图。每份检查均采用标准化诊断标准进行标注,并由心脏病专家审核。CODE-II的一个关键特征是其包含66个具有临床意义的诊断类别,这些类别由心脏病专家参与制定,并常规应用于远程医疗实践。我们还提供了一个公开可用的子集:CODE-II-open,包含15,000名患者的公开子集,以及CODE-II-test,这是一个由多位心脏病专家进行盲法评估的8,475份检查的非重叠集合。在CODE-II上预训练的神经网络在外部基准测试(PTB-XL和CPSC 2018)中表现出优异的迁移性能,并优于在更大数据集上训练的替代模型。