Khan QW. Evaluating Machine Learning Models for Early Diabetes Prediction: A Comparative Study

Qazi Waqas Khan

Research Article

Austin Diabetes Res. 2024; 9(1): 1032.

Evaluating Machine Learning Models for Early Diabetes Prediction: A Comparative Study

Qazi Waqas Khan*

Department of Computer Engineering, Jeju National University, Jejusi 63243, Jeju Special Self-Governing Province, Republic of Korea

*Corresponding author: Qazi Waqas Khan, Department of Computer Engineering, Jeju National University, Jejusi 63243, Jeju Special Self-Governing Province, Republic of Korea. Email: waqasqazi19@stu.jejunu.ac.kr

Received: July 01, 2024 Accepted: July 18, 2024 Published: July 25, 2024

Abstract

High blood glucose levels can affect the body’s organs, causing blindness, renal illness, and heart and kidney diseases. Globally, Diabetic patients experience a mortality rate of 38% yearly. Machine Learning methods are used in the literature to predict diabetes. The prediction of machine learning models can assist doctors in making early decisions. This study employed the Neural Oblivious Decision Ensembles (NODE), Xtreme Gradient Boosting (XGB), AdaBoost, and Support Vector Machine (SVM) models to diagnose diabetes. An early-risk diabetes dataset is utilized in this study to conduct the experiments. The principal component analysis method is employed to extract the features. The performance metrics for evaluating machine learning classifiers are accuracy, precision, recall, and f score. The experimental results of the learning models show that the XGB model has achieved higher prediction results than the SVM, AdaBoost, and NODE. These findings conclude that the utilization of this approach assists the stakeholders in the diagnosis of early diabetes.

Introduction

High blood sugar is a leading factor of death, depicting diabetes as a destructive chronic illness and creating an alarming condition. According to WHO, the number of diabetic patients increased significantly from 108 million in 1980 to 422 million in 2014 [1]. About 8.5% of adults and 30.3% of the U.S. population are affected by diabetes [2]. China and India, being the most populous countries, have the highest diabetes rates of 98 million and 65.1 million cases, respectively [3]. Both types of diabetes are serious conditions. Type 1 diabetes attacks the pancreas and affects the formulation of insulin in the body; type 2 diabetes includes insulin resistance, which stops the body from using insulin, causing high blood glucose levels [4]. Diabetes cannot be cured but it can be treated, early diagnosis can minimize the complication risks [5]. A balanced diet and early detection can increase an individual’s lifespan. Detecting diabetes at an early stage based on a doctor’s assessment can be inaccurate because of gaps in understanding the related patterns [6]. However, predictive analytics can improve the identification of at-risk individuals, anticipate issues, and enhance treatment results [7]. Predictive analytics can identify high-risk individuals, predict complications, and enhance care. A doctor can determine the most effective treatment course for everyone affected by diabetes, leading to better outcomes [8]. Therefore, a Computer-Aided Diagnosis (CAD) system can help physicians make better decisions for diagnosing diabetes at an early stage. [9]. The CAD system analyzes blood sugar levels, haemoglobin A1C levels, and other useful clinical data to detect diabetes and suggest necessary actions depending on the information obtained.

The Neural Oblivious Decision Ensembles (NODE), Xtreme Gradient Boosting (XGB), AdaBoost, and Support Vector Machine (SVM) models are used in this study to predict diabetes—label encoder method to convert the text category into numeric. The standard scalar method converts the feature value into the same between 0 and 1.

The structure of the paper is described as Section 2 explains the details of a proposed method for diabetes prediction. Sections 3 and 4 explain the experimental results and conclusion of the paper.

Proposed Methods

This section briefly describes the proposed machine learning models used for diabetes prediction. Figure 1 shows the architecture diagram, which shows that we used the early diabetes risk dataset as an input for the data pre-processing module. In data pre-processing, the label encoding, and standard scalar method are applied for data preparation. The prepared data is transmitted as input to proposed models for diabetes classification. The performance of a machine learning model is analyzed using the accuracy, precision, recall, and f score.

Figure 1: Comparison graph of Machine Learning Model for Diabetes Classification (with PCA feature extraction) based on accuracy.

    
    
    Figure 1: Comparison graph of Machine Learning Model for Diabetes Classification (with PCA feature extraction) based on accuracy.

Figure 2: Comparison graph of Machine Learning Model for Diabetes Classification (with PCA feature extraction) based f score.

    
    
    Figure 2: Comparison graph of Machine Learning Model for Diabetes Classification (with PCA feature extraction) based f score.

Figure 3: Comparison graph of Machine Learning Model for Diabetes Classification (without PCA feature extraction) based on accuracy.

    
    
    Figure 3: Comparison graph of Machine Learning Model for Diabetes Classification (without PCA feature extraction) based on accuracy.

Figure 4: Comparison graph of Machine Learning Model for Diabetes Classification (without PCA feature extraction) based on fs core.

    
    
    Figure 4: Comparison graph of Machine Learning Model for Diabetes Classification (without PCA feature extraction) based on fs core.

Data Preprocessing

The data preprocessing method cleans and prepares the raw data for the learning model. In this study, the scikit label encoder method converts the text category into numeric, and the z-score method normalizes the data into the same scale.

Machine Learning Models

This study utilized the NODE, XGB, Adaboost, and SVM models to classify diabetes. These models are used with and without the Principal Component Analysis (PCA). The PCA method extracts features and transforms them into n components, where n is a target feature space.

Node Neural Oblivious Decision Ensembles

The NODE [11] is a robust machine-learning architecture that merges decision trees with neural network architectures to improve predictive accuracy. By utilizing an oblivious decision tree, NODE efficiently captures complex patterns in the data, making it highly suitable for classification and regression tasks.

Xtreme Gradient Boosting

Xtreme Gradient Boosting (XGB) [12] is a powerful and effective gradient-boosting framework that outperforms classification and regression. It employs parallel processing and advanced regularization techniques to boost model robustness and prevent overfitting.

AdaBoost

Adaptive boosting, known as AdaBoost [13], is a robust machine-learning algorithm that integrates multiple weak classifiers to create a robust classifier. In a later iteration, it adjusts the weights of misclassified instances, focusing more on challenging cases to enhance the performance for both classification and regression.

Support Vector Machine

The SVM [14] is a supervised machine-learning algorithm that classifies cases by finding the optimum hyperplane that maximizes the margin between data points. It uses the kernel function to map the data into a higher-dimensional feature space for better class separation.

Evaluation Metrics

This study validated the performance of a machine learning model using accuracy, precision, recall, and f-score evaluation metrics.

Results and Discussion

The table presents the experimental outcomes of the proposed classifiers for classifying diabetes with PCA feature extraction. It shows four different classifiers evaluated using the given metrics: Accuracy, Precision, Recall, and F-score. Results explain that XGB is highly effective, surpassing all the models with the highest prediction performance. AdaBoost and NODE also performed well, with high scores across all metrics, and SVM showed lower performance than all other proposed models, with moderate scores across all metrics.

Table 1: Experimental Results of a Machine Learning Model for Diabetes Classification (with PCA feature extraction).




  
    Model Name
    Accuracy
    Precision
    Recall
    F score
  
  
    AdaBoost
    0.9423
    0.9469
    0.9423
    0.9431
  
  
    SVM
    0.8654
    0.8654
    0.8654
    0.8654
  
  
    XGB
    0.9615
    0.9629
    0.9615
    0.9618
  
  
    NODE
    0.9327
    0.9334
    0.9327
    0.9330



Table 1: Experimental Results of a Machine Learning Model for Diabetes Classification (with PCA feature extraction).

Table 2: Experimental Results of a Machine Learning Model for Diabetes Classification (without PCA feature extraction).




  
    Model Name
    Accuracy
    Precision
    Recall
    F score
  
  
    AdaBoost
    0.9135
    0.9143
    0.9135
    0.9138
  
  
    SVM
    0.8942
    0.8952
    0.8942
    0.8946
  
  
    XGB
    0.9808
    0.9819
    0.9808
    0.9809
  
  
    NODE
    0.9423
    0.9439
    0.9423
    0.9427



Table 2: Experimental Results of a Machine Learning Model for Diabetes Classification (without PCA feature extraction).

This figure compares the F-score of proposed machine-learning models used for diabetes classification with PCA feature extraction. The chart shows that XGB achieves the highest f-score of 96.18% with AdaBoost and NODE having slightly less f-score, while SVM has the lowest f-score among all the proposed classifiers.

This table represents the experimental outcomes of four proposed machine-learning classifiers for classifying diabetes without PCA feature extraction. It can be seen from experimental results that the XGB is a highly effective classifier in this case as well, as it receives the highest score across all the metrics; NODE and AdaBoost also performed well, while SVM shows the lowest performance than other classifiers. The results also indicate that XGB outperforms with or without PCA extraction, SVM and NODE enhance their performance without PCA extraction, and AdaBoost slightly decreases its performance when PCA is not used.

This figure compares the Accuracy of proposed machine-learning models used for diabetes classification without PCA feature extraction. The chart shows that XGB achieves the highest accuracy of 98.08%, demonstrating its effectiveness in diabetes classification; NODE also performs well but is slightly lower than XGB. AdaBoost has a moderate performance, and SVM has the lowest performance rate of all the classifiers.

This figure compares the F-score of proposed machine-learning models for diabetes classification without PCA feature extraction. The chart shows that XGB leads in performance by achieving the highest F-score of 98.09%, showing efficiency in diabetes classification. NODE also performs well but is slightly lower than XGB. AdaBoost has a moderate performance, and SVM has the lowest performance rate of all the classifiers.

Conclusion

The computer-aided diagnosis systems are crucial to diagnose the diseases. These computer-aided diagnosis systems assist doctors in detecting or diagnosing the disease. Diabetes is a chronic disease, and detection of this disease at the earliest stage is crucial to saving a patient's life. This study employed machine learning-based methods for diagnosing diabetes. The experiment shows that the XGB model has achieved higher prediction results than other models. In contrast, the SVM model has achieved lower prediction accuracy than the other proposed models. The experimental findings show that this method helps stakeholders in the early diagnosis of diabetes. In the future, we can utilize some wrapper-based feature selection methods to select the optimal feature from the dataset. Further, we will use the ensemble model for the prediction of diabetes.

References

Download PDF

Citation: Khan QW. Evaluating Machine Learning Models for Early Diabetes Prediction: A Comparative Study. Austin Diabetes Res. 2024; 9(1): 1032.

Home

Journal Scope

Editorial Board

Instruction for Authors

Submit Your Article