Using Three Machine Learning Techniques for Predicting Breast Cancer Recurrence
Mohammad Mehdipour,1,*
1. ORCID ID:0009-0005-4794-2040/Comprehensive Helath Research Center,Bab.Co,Islamic Azad University,Babol,Iran
Introduction: Despite advances in breast cancer diagnosis and treatment, recurrence remains a serious concern for many patients. Early identification of individuals at high risk of recurrence can lead to more targeted monitoring and improved survival outcomes. Machine learning (ML) algorithms offer significant potential in healthcare analytics, particularly in discovering complex patterns and enhancing predictive accuracy. This study focuses on evaluating the effectiveness of three widely-used ML algorithms in predicting two-year recurrence of breast cancer based on historical clinical data.
Methods: Data Source and Study Population:
The data used in this study were collected from the Iranian Center for Breast Cancer (ICBC), covering patient records from 1997 to 2008. From an initial dataset of 1,189 records, data cleaning and preprocessing steps refined the dataset to 547 complete records suitable for model development and evaluation.
Predictor Variables:
A total of 22 clinical variables were used as predictors, including patient demographics, tumor pathology, and treatment parameters. The target variable was binary, indicating recurrence or non-recurrence within a two-year follow-up period.
Machine Learning Algorithms:
Three ML algorithms were used to construct and compare predictive models:
1. Decision Tree (C4.5): A rule-based algorithm that splits data based on information gain. It is highly interpretable and suitable for clinical settings.
2. Support Vector Machine (SVM): A robust margin-based classifier known for high performance in high-dimensional spaces and resistance to overfitting.
3. Artificial Neural Network (ANN): A multilayer feed-forward model capable of capturing complex nonlinear relationships within data.
Model Evaluation Strategy:
A 10-fold cross-validation method was employed to assess model generalizability and reduce bias or overfitting.
Models were evaluated based on three key performance metrics:
• Accuracy: Overall correctness of predictions.
• Sensitivity (Recall): Ability to correctly identify true recurrence cases.
• Specificity: Ability to correctly identify non-recurrence cases.
Results: The findings revealed that the Support Vector Machine (SVM) model achieved the highest predictive accuracy of 95.7%, along with the strongest sensitivity and specificity values.
The Artificial Neural Network (ANN) model followed closely with an accuracy of 94.7%, demonstrating its capability to effectively model nonlinear patterns in clinical data.
The Decision Tree (C4.5) algorithm achieved a slightly lower accuracy of 93.6%, but its transparency and ease of interpretation make it a practical tool for clinicians.
Model Comparison Summary:
Model Accuracy Sensitivity Specificity
SVM 95.7% Highest Highest
Neural Network 94.7% High High
Decision Tree C4.5 93.6% Moderate Moderate
Conclusion: This study demonstrates that machine learning algorithms—especially SVM and ANN—are highly effective in predicting breast cancer recurrence within two years after treatment. These models can serve as valuable tools to support clinical decision-making by identifying high-risk patients early, allowing for personalized follow-up and intervention strategies. Despite slightly lower accuracy, Decision Trees remain beneficial due to their interpretability, making them suitable for clinical environments where transparency is essential.
Keywords: Breast cancer, recurrence prediction, machine learning, data mining, support vector machine, neural
Join the big family of Cancer Genetics and Genomics!