Development of machine learning-based models to predict 10-year risk of cardiovascular disease: a prospective cohort study

oleh: Yu Guo, Jin-Tai Yu, Ming Yang, Wei Cheng, Hui-Fu Wang, Jia You, Ju-Jiao Kang, Jian-Feng Feng

Format: Article
Diterbitkan: BMJ Publishing Group

Deskripsi

Background Previous prediction algorithms for cardiovascular diseases (CVD) were established using risk factors retrieved largely based on empirical clinical knowledge. This study sought to identify predictors among a comprehensive variable space, and then employ machine learning (ML) algorithms to develop a novel CVD risk prediction model.Methods From a longitudinal population-based cohort of UK Biobank, this study included 473 611 CVD-free participants aged between 37 and 73 years old. We implemented an ML-based data-driven pipeline to identify predictors from 645 candidate variables covering a comprehensive range of health-related factors and assessed multiple ML classifiers to establish a risk prediction model on 10-year incident CVD. The model was validated through a leave-one-center-out cross-validation.Results During a median follow-up of 12.2 years, 31 466 participants developed CVD within 10 years after baseline visits. A novel UK Biobank CVD risk prediction (UKCRP) model was established that comprised 10 predictors including age, sex, medication of cholesterol and blood pressure, cholesterol ratio (total/high-density lipoprotein), systolic blood pressure, previous angina or heart disease, number of medications taken, cystatin C, chest pain and pack-years of smoking. Our model obtained satisfied discriminative performance with an area under the receiver operating characteristic curve (AUC) of 0.762±0.010 that outperformed multiple existing clinical models, and it was well-calibrated with a Brier Score of 0.057±0.006. Further, the UKCRP can obtain comparable performance for myocardial infarction (AUC 0.774±0.011) and ischaemic stroke (AUC 0.730±0.020), but inferior performance for haemorrhagic stroke (AUC 0.644±0.026).Conclusion ML-based classification models can learn expressive representations from potential high-risked CVD participants who may benefit from earlier clinical decisions.