Diabetes is a serious worldwide health issue, and successful intervention depends on early detection. However, overlapping risk factors and data asymmetry make prediction difficult. To use extensive health survey data to create a machine learning framework for diabetes classification that is both accurate and comprehensible, to produce results that will aid in clinical decision-making. Using the BRFSS dataset, we assessed a number of supervised learning techniques. SMOTE and Tomek Links were used to correct class imbalance. To improve prediction performance, both individual models and ensemble techniques such as stacking were investigated. The 2015 BRFSS dataset, which includes roughly 253,680 records with 22 numerical features, is used in this study. Strong ROC-AUC performance of approximately 0.96 was attained by the individual models Random Forest, XGBoost, CatBoost, and LightGBM.The stacking ensemble with XGBoost and KNN yielded the best overall results with 94.82\% accuracy, ROC-AUC of 0.989, and PR-AUC of 0.991, indicating a favourable balance between recall and precision. In our study, we proposed and developed a React Native-based application with a Python Flask backend to support early diabetes prediction, providing users with an accessible and efficient health monitoring tool.
翻译:糖尿病是全球性的严重健康问题,早期检测对于有效干预至关重要。然而,重叠的风险因素和数据不对称性使得预测变得困难。本研究旨在利用大规模健康调查数据,构建一个兼具高准确性与可解释性的机器学习框架用于糖尿病分类,以生成有助于临床决策的结果。基于BRFSS数据集,我们评估了多种监督学习方法,采用SMOTE和Tomek Links技术校正类别不平衡问题。为提升预测性能,研究同时探讨了单一模型及集成方法(如堆叠集成)。本研究使用2015年BRFSS数据集,包含约253,680条记录及22个数值特征。随机森林、XGBoost、CatBoost和LightGBM等单一模型均表现出色,ROC-AUC性能达到约0.96。采用XGBoost与KNN构建的堆叠集成模型取得了最佳综合结果:准确率达94.82%,ROC-AUC为0.989,PR-AUC为0.991,表明其在召回率与精确度之间实现了良好平衡。此外,我们设计并开发了一款基于React Native框架、搭载Python Flask后端的应用程序,以支持早期糖尿病预测,为用户提供便捷高效的健康监测工具。