Skip to content

tanyaort/503Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 

Repository files navigation

503Project

Cervical Cancer Risk Prediction

This project is a part of the ADS-503 course in the Applied Data Science Program at the University of San Diego.

Project Status

Active

Installation

This project was completed in RStudio using .Rmd and .R files. To reproduce or run this project:

Clone this repository from GitHub:

Open the R project or .Rmd file in RStudio

Install the required packages (e.g., tidyverse, caret, corrplot, pROC)

Run the script section by section to explore the data, preprocess, split, and train models

Project Intro/Objective

This project has a primary objective of developing a predictive model that predicts the risk of cervical cancer based on behavioral, demographic and clinical risk factors. This work aims to help with early identification and preventive care through the prediction of those who may be able to undergo screening in a timely manner.

We are utilizing the Cervical Cancer Risk Factors dataset from the UCI Machine Learning Repository, which consists of features that include: age, number of sexual partners, contraceptive use, STD history and smoking.

Partner(s)/Contributor(s)

  • Tanya Ortega

  • Cynthia Portales-Loebell

  • Lei Lin

Each member contributed to data cleaning, modeling, evaluation, and documentation. Final responsibilities will include presenting the results and submitting a technical report and summary slide deck

Methods Used

  • Data Cleaning & Wrangling

  • Exploratory Data Analysis (EDA)

  • Predictive Modeling

  • Logistic Regression

  • Decision Tree

  • Random Forest

  • XGBoost

  • Model Evaluation (AUC, Accuracy, Recall, etc.)

  • Data Visualization

  • Data Splitting (Train/Test)

Technologies

  • R

  • RStudio

  • Tidyverse

  • ggplot2

  • caret

  • pROC

Project Description

We are working with the Cervical Cancer (Risk Factors) Data Set containing 858 records and 36 variables. The dataset includes a variety of binary, categorical, and numerical predictors tied to cervical cancer risk

Dataset Source:

Target Variable:

  • biopsy (1 = positive diagnosis, 0 = negative)

Analysis Focus:

  • Perform EDA to understand distributions and variable importance

  • Handle missing values and impute where appropriate

  • Train classification models to predict positive cervical cancer diagnosis

  • Compare model performance using common evaluation metrics

License

Acknowledgments

Special thanks to Professor An Tran for guidance and support throughout ADS 503, and to our classmates for their collaboration. Dataset courtesy of the UCI Machine Learning Repository, originally compiled by Hospital Universitario de Caracas.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •