Predicting Cervical Cancer from Risk Factors
Like any cancer, cervical cancer has risk factors that can increase the chances of someone developing that particular cancer. Some cervical cancer risk factors are HPV infection, smoking, sexual history, weakened immune system, infection with STDs, long-term use of oral contraceptives, and having many full-term pregnancies.
A dataset, collected by Hospital Universitario de Caracas in Caracas, Venezuela, contains data on these risk factors from 858 women. I used this dataset to try and predict cervical cancer. Biopsy results were what determined whether the patient had cancer or not, so I built a predictive model that aimed to predict biopsy results based on risk factors.
Similar to other medical datasets, the dataset had imbalanced data. A large majority of the people did not have cancer, while a very small minority did. This can make modeling a bit more difficult, but I decided to give it a try either way.
Because of the imbalanced data, my baseline accuracy was pretty high: 93.63%. I also used precision and recall as metrics. At baseline, both were 0. My goal was to build a model that would get higher accuracy, precision, and recall than those at baseline.
I created 3 different models: logistic regression, RandomForest classifier, and XGB classifier. I tuned the hyper parameters of each one of these models, leaving me with a total of 6 models to compare. I checked the metrics on the models. Two models seemed to be doing a bit better than the rest — tuned RandomForest classifier and tuned XGB classifier. The tuned RandomForest classifier had an accuracy of 93.92%, precision of 40%, and recall of 11.5%. The tuned XGB classifier had 93.18% accuracy, 33% precision, and 12% recall. Up until this point, I worked using one subset of the original dataset — the training set. The next step in this project was to compare how each model performed on a subset of the dataset it had never seen before — the test set.
I created an ROC curve to visualize the performance of each model on the test set. As seen in Figure 1, the model that performs best is the RandomForest model, due to its ROC AUC score of 0.645. The closer the ROC AUC score is to 1, the better the model performs. This is true, because when the ROC AUC score is 1, the true positive rate is 1, while the false positive rate is 0 — meaning the model performs perfectly.
Once I determined that the RandomForest classifier was the best model, I decided to take a closer look at its performance.
I calculated the metric scores for the RandomForest classifier at various cutoff thresholds. Higher cutoff thresholds favor precision and accuracy, while lower ones favor recall. This trend is not very evident in Figure 2. It is a result of the model not having much predictive power. The accuracy never goes above the baseline of 93.63% — the highest is 92.89%. Recall and precision do better than baseline, but only up until the cutoff passes 0.35.
Even though RandomForest classifier has very little predictive power, I took a look at which features — variables — from the dataset played a larger role in the model’s performance.
As seen in Figure 3, RandomForest Classifier uses age, years sexually active, and years of taking hormonal contraceptives the most when classifying a case as negative or positive. Using age as a likely indicator of cancer is not too far off. According to Cancer Research UK, age is the biggest risk factor.
I decided to take an even closer look at the inner workings of my model by seeing how it classified two specific cases in the test set. One was a negative case and the other was positive.
Force plots show what feature is pushing the classification, how much it is pushing the classification, and it what direction it is pushing the classification. The more blue there is in a force plot, the more the classification is being pushed towards negative classification. Red is the opposite — meaning the classification will more likely be positive.
In Figure 4, there is a lot more blue, making it more likely that the model will classify this case as negative. The case is indeed negative. The number of pregnancies and number of sexual partners are causing the biggest pushes towards a negative classification. Based on what is known about cervical cancer risk, the model is making the right assumptions: less pregnancies and less sexual partners lower risk.
Figure 5 is predominantly red, meaning it has a case that would be classified as positive. This case is indeed positive. Years sexually active and age are the biggest red push in this force plot. This case has a long sexual history and is over 50. Cancer becomes increasingly more probable at the age of 50 and beyond. Again, the model is weighing the main pushes well — longer sexual history and larger age increase cervical cancer risk.
Although my model did not have much predictive power, it is making a small step in the right direction. In the future, I would like to give making a more successful model a try. One idea I have to accomplish this is to implement Synthetic Minority Oversampling TEchnique (SMOTE), which can be very useful for classification problems with imbalanced data — like the one I tackled with the current project.
If you would like to view my model and data, you can find that here.