Project overview

Cardiovascular disease (CVD) poses a significant global health challenge, leading to millions of annual fatalities and imposing a substantial burden on healthcare systems worldwide.

Diabetes mellitus is a major risk factor for cardiovascular disease.
It is essential to utilize the best-known algorithms and biomarkers for a precise evaluation of cardiovascular health, as this approach is crucial not only for cost- effectiveness but also for timely and accurate diagnosis, ensuring optimal care for patients.

Data set

The data set in this project includes cardiovascular risk factors such as age, gender, blood pressure, cholesterol, smoking status, and patient treatment status.

In the beginning, we developed a model based on expert opinions to predict the likelihood of diabetic patients developing cardiovascular disease in the next ten years, and we optimized it. In the next step, we evaluated our model compared to the Framingham calculator, ASCV Calculator, and the biomarker HS-CRP.

Actually, the Framingham calculator provides a set of algorithms that can predict the likelihood of cardiovascular diseases in the next 10 years. The ASCVD risk score and HS-CRP are aligned with Framingham, as extensive studies by various institutions have been conducted to create calculators for these parameters.

In the Framingham calculator, the cardiovascular age may differ from the chronological age of the patient. For example, if someone is 50 years old, their cardiovascular age could be 70 years, indicating that the condition of their heart and arteries is worse than their chronological age suggests.

The ASCVD Risk Estimator also incorporates race into its calculations.

Data cleaning and preprocessing

Data Gathering

The medical records of 2153 patients who regularly visit a diabetes management center in Isfahan, Iran, were gathered. Subsequently, by applying specific inclusion and exclusion criteria, the dataset was streamlined to 418 entries.

Importing data from risk score calculators into the dataset

  • FPR Risk Score, ASCVD Risk Score, and heart vascular age are calculated for each patient.

  • HS-CRP factor are obtained during laboratory check.

Encoding categorical variables

Encoding categorical data such as Sex, Smoking, HTN_treat into binary numerical values.

Feature selection

  • Removing irrelevant features to increase the accuracy of our model.

  • Removing missing and null values

Data transformation

Converting numerical data into binary variable for comparison and clustering.

  • High Risk

  • Non-High Risk

Data splitting

Creating train and test set block by expert opinion

Data set after cleaning and preprocessing

Logistic regression

Predicting the risk of cardiovascular disease development in diabetic patients using expert opinions.

• Choosing one variable from highly correlated ones

Logistic regression: model 1

Creating multiple models and using combinations of variables and comparing them.

Logistic regression: model 2

Logistic regression: model 3

Logistic regression: model 4

Logistic regression: model 5 (final model)

Best model: model (5) while threshold is set as 0.4

We aimed for a test with high sensitivity, meaning that few false negative results would occur, resulting in a lower chance of overlooking cases involving high-risk patients.

Clustering

We clustered data into High risk and non-high risk clusters for patient with age 60, according to cut off points for different calculators :

  1. Framingham Risk Score: 20%

  2. ASCVD Risk Score: 7.5%

  3. HS-CRP: 3 mg/L

Hierarchical clustering was conducted using the following variables:

  • H_D_Prev

  • New_Hs_CRP

  • New_ASCVD_rscore 10

  • New_FOR_Score

  • New_Patient_H_V_age

  • Model Result

d(Model, H_D_Pred) = 0.099

d(Model, New_ASCVD_rscore 10) = 0.19

d(Model, New_hs_CRP) = 0.33

d(Model, New_FPR_score) = 0.68

d(Model, New_H_V_age) = 0.092

Our model is close to ASCVD_rscore and H_V_age

Tableau visualization

Conclusion

  • Model effectively predicts high-risk patients for the CVDs based on expert opinion.

  • Based on the evaluation of the model in comparison to the Framingham Risk Score and ASCVD Risk Score, it has been obtained that the model can predict more similarly to the ASCVD Risk Score than the Framingham Risk Score.

  • The Heart/Vascular age variable has a strong potential for predicting CVDs, showing high resemblance to the model.

  • On the other hand, hsCRP does not have any significant correlation with model in prediction of high-risk patients.