This project is a part of the ADS-503 course in the Applied Data Science Program at the University of San Diego.
Active
This project was completed in RStudio using .Rmd and .R files. To reproduce or run this project:
Clone this repository from GitHub:
Open the R project or .Rmd file in RStudio
Install the required packages (e.g., tidyverse, caret, corrplot, pROC)
Run the script section by section to explore the data, preprocess, split, and train models
This project has a primary objective of developing a predictive model that predicts the risk of cervical cancer based on behavioral, demographic and clinical risk factors. This work aims to help with early identification and preventive care through the prediction of those who may be able to undergo screening in a timely manner.
We are utilizing the Cervical Cancer Risk Factors dataset from the UCI Machine Learning Repository, which consists of features that include: age, number of sexual partners, contraceptive use, STD history and smoking.
-
Tanya Ortega
-
Cynthia Portales-Loebell
-
Lei Lin
Each member contributed to data cleaning, modeling, evaluation, and documentation. Final responsibilities will include presenting the results and submitting a technical report and summary slide deck
-
Data Cleaning & Wrangling
-
Exploratory Data Analysis (EDA)
-
Predictive Modeling
-
Logistic Regression
-
Decision Tree
-
Random Forest
-
XGBoost
-
Model Evaluation (AUC, Accuracy, Recall, etc.)
-
Data Visualization
-
Data Splitting (Train/Test)
-
R
-
RStudio
-
Tidyverse
-
ggplot2
-
caret
-
pROC
We are working with the Cervical Cancer (Risk Factors) Data Set containing 858 records and 36 variables. The dataset includes a variety of binary, categorical, and numerical predictors tied to cervical cancer risk
- UCI Machine Learning Repository: https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors
- biopsy (1 = positive diagnosis, 0 = negative)
-
Perform EDA to understand distributions and variable importance
-
Handle missing values and impute where appropriate
-
Train classification models to predict positive cervical cancer diagnosis
-
Compare model performance using common evaluation metrics
Special thanks to Professor An Tran for guidance and support throughout ADS 503, and to our classmates for their collaboration. Dataset courtesy of the UCI Machine Learning Repository, originally compiled by Hospital Universitario de Caracas.