Skip to content

End-to-end applied data science project integrating multiple public education datasets to model and interpret factors influencing student graduation outcomes.

License

Notifications You must be signed in to change notification settings

junclemente/ca-early-warning-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

47 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿš€ California School-Level Early Warning System (EWS) for Predicting Graduation Outcomes

A Machine Learning Approach to Identifying At-Risk California Public High Schools Using Public, Non-PII Data

This project is a part of the ADS-599 course in the Applied Data Science Program at the University of San Diego.

Python Conda Streamlit Jupyter Status License

-- Project Status: Completed


๐ŸŽฅ Project Presentation (YouTube)

Project Presentation

๐Ÿ“„ Whitepaper / Final Report

You can read the full capstone whitepaper here:

โžก๏ธ California Early Warning System โ€“ Final Whitepaper


๐Ÿ“ฆ Installation

To use this project, first clone the repo on your device using the command below:

git init
git clone https://github.com/junclemente/msads_capstone.git

๐Ÿงช Environment Setup

This project uses a conda environment specified in a YAML file for reproducibility and consistent development. Ensure you have Anaconda or Miniconda installed.

Create the Environment
Run the following:

conda env create -f environment.yml

Update the Environment (if needed)
If there are any updates to the environment, you can update the environment with the following:

conda env update -f environment.yml --prune

The --prune option cleans the environment by removing packages that are no longer required.

๐Ÿ“ Project Structure

msads_capstone/
โ”œโ”€โ”€ .github/
โ”œโ”€โ”€ app/
โ”œโ”€โ”€ code_library/
โ”œโ”€โ”€ data/
โ”œโ”€โ”€ docs/
โ”œโ”€โ”€ media/
โ”œโ”€โ”€ models/
โ”œโ”€โ”€ CONTRIBUTING.md
โ”œโ”€โ”€ environment.yml
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ main_notebook.md
โ””โ”€โ”€ README.md

๐Ÿ“Œ Notes on Project Organization

All Jupyter notebooks are located inside the code_library/ folder. This was intentionally kept in place to preserve stable import paths and ensure all notebooks run without modification.

The code_library/ folder contains both reusable Python utilities (helper.py) and all project notebooks for data collection, preparation, exploration, and modeling.

โ–ถ๏ธ How to Run the Streamlit App

๐ŸŒ Run the Web Version

Click the following link to run the web version: http://ca-early-warning-system.streamlit.app

Webapp Key Features

Feature Inputs by ABCS Categories

๐ŸŽฏ School-Level Graduation Risk Prediction Users can interactively adjust key predictors to instantly estimate whether a school is At Risk or On Track.

๐Ÿ“Š Real-Time Model Output
The app automatically displays:

  • Predicted risk category
  • Model confidence / probability

๐Ÿงฎ Interactive Scenario Exploration Users can simulate โ€œwhat-ifโ€ scenarios such as:

  • What if chronic absenteeism decreases?
  • What if FRPM eligibility drops by 10%?
  • How does the student-to-support-staff ratio impact graduation outcomes?

๐Ÿ’ป Run locally

  1. Clone this repository.
  2. Create the conda environment.
  3. Activate the conda environment and run the streamlit application:
    conda activate capstone
    streamlit run app/Home.py

๐ŸŽฏ Project Intro / Objective

The main purpose of this project is to develop a school-level Early Warning System (EWS) that identifies California public high schools at risk of low graduation outcomes using only public, non-PII datasets. By leveraging statewide indicators aligned with the ABC frameworkโ€”Attendance, Behavior, and Course performanceโ€”this project demonstrates that actionable early-warning signals can be generated without relying on restricted student-level records.

The goal is to provide California educators, policymakers, and district leaders with a scalable, transparent, and privacy-preserving tool for monitoring emerging risk, understanding systemic inequities, and supporting data-informed planning and resource allocation.

๐Ÿ‘ฅ Partner(s)/Contributor(s)

๐Ÿ› ๏ธ Methods Used

  • Data Cleaning & Preprocessing
  • Exploratory Data Analysis (EDA)
  • Feature Engineering
  • Predictive Modeling & ML Classification
  • Feature Selection using Random Forest
  • Data Visualization

๐Ÿงฐ Technologies

  • Python
  • Pandas
  • Numpy
  • Matplotlib
  • Seaborn
  • Jupyter Notebook
  • VSCode
  • Streamlit
  • Conda
  • Git / GitHub

๐Ÿ“˜ Project Description

This project develops a simplified Early Warning System (EWS) that predicts high-school graduation outcomes using only publicly accessible datasets from the California Department of Education (CDE). Using 2021โ€“22 school-level indicatorsโ€”such as graduation rates, chronic absenteeism, FRPM eligibility, teacher experience, and school characteristicsโ€”combined with county-level climate data, we built a cleaned modeling dataset of 958 schools and 25 predictors. A binary target (โ€œAt Riskโ€ < 90% graduation rate) highlighted a 26.3% minority class, and analysis confirmed strong ABC-aligned patterns between absenteeism, socioeconomic disadvantage, teacher experience, and graduation outcomes. Multiple machine learning models were evaluated with PR-AUC, Precision, Recall, and F1 due to class imbalance; Random Forest and Logistic Regression performed best, with top predictors including chronic absenteeism, unexcused absences, FRPM eligibility, still-enrolled rate, and Aโ€“G completion rate. Key challenges included inconsistent county reporting, FERPA-related suppression, and missing climate indicators, though all data were aggregate and fully public.

๐Ÿ“Š Dataset Summary

This project integrates multiple publicly accessible, non-PII datasets from the California Department of Education (CDE) and CalSCHLS to build a unified school-level dataset for modeling graduation outcomes. All data represent the 2021โ€“22 school year, except for the CalSCHLS climate data (2017โ€“19), which is the most recent available.

๐Ÿ“ Final Modeling Dataset

  • Total schools: 958 California public high schools
  • Predictor variables: 25 engineered and cleaned features
  • Target variable:
    • At Risk (1): Graduation rate < 90%
    • On Track (0): Graduation rate โ‰ฅ 90%
  • Class balance:
    • On Track: 73.7%
    • At Risk: 26.3%

๐ŸŒ Raw Data Sources

Below are the official public websites where all raw datasets used in this project can be downloaded:

๐Ÿ† Results

Using the top-15 features selected through Random Forest feature importance, seven classification models were evaluated on a stratified test set (20% split). Performance was compared using the PR-AUC as the primary metric due to class imbalance, with Precision, Recall, and F1-Score also included.

Model Comparison (PR-AUC)

The highest performing models were:

  • Random Forest - PR-AUC 0.775
  • Logistic Regression - PR-AUC 0.763
  • Naive Bayes - PR-AUC 0.755

Overall Finding

Interpretability of the chosen model was not the highest priority since the predictions do not directly impact individual students. The Random Forest model delivered the strongest balance of precision and recall and the highest PR-AUC, making it the most reliable classifier for identifying schools At Risk of low graduation rates.

Model Comparison Chart

๐Ÿ“ˆ Models Compared

The following models were tested to compare predictive effectiveness under class imbalance conditions using PR-AUC as the primary evaluatoin metric.

  • Logistic Regression
  • Naive Bayes
  • Random Forest
  • XGBoost
  • SVM
  • Decision Tree
  • KNN

Summary of model performance:

Model Precision Recall F1-Score PR-AUC
Random Forest 0.720 0.706 0.713 0.775
Logistic Regression 0.549 0.765 0.639 0.763
Naive Bayes 0.547 0.686 0.609 0.755
XGBoost 0.702 0.647 0.673 0.707
Decision Tree 0.500 0.725 0.592 0.548
SVM 1.000 0.059 0.111 0.533
KNN 0.517 0.294 0.375 0.397

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ™ Acknowledgments

We thank the University of San Diegoโ€™s Applied Data Science faculty for their support and feedback throughout the ADS-599 Capstone. We also acknowledge the California Department of Education (CDE) for providing publicly accessible datasets on graduation outcomes, absenteeism, staffing, and school demographics, as well as the CalSCHLS/WestEd teams for making county-level school climate data publicly available. Their commitment to open data enabled us to build a fully reproducible, school-level Early Warning System.

We also appreciate the collaborative contributions of our teammatesโ€”Amayrani Balbuena, Tanya Ortega, and Jun Clementeโ€”in data collection, analysis, modeling, and application development.

๐Ÿค– AI Assistance Disclosure

Portions of this project, including selected code snippets, debugging suggestions, and explanatory text, were developed with the assistance of ChatGPT by OpenAI. The authors used AI tools to accelerate brainstorming, refine documentation, and troubleshoot code behavior.

All AI-generated material was manually reviewed, tested, and edited by the authors to ensure correctness, accuracy, and alignment with the project requirements.

๐Ÿ“š References

Austin, G., Hanson, T., Bala, N., & Zheng, C. (2023). Student engagement and well-being in California, 2019-21: Results of the Eighteenth Biennial State California Healthy Kids Survey, Grades 7, 9, and 11. WestEd. https://data.calschls.org/resources/18th_Biennial_State_1921.pdf

California Department of Education. (n.d.). Retrieved October 26, 2025, from https://www.cde.ca.gov/

Chen, T., Wanberg, R. C., Gouioa, E. T., Brown, M. J. S., Chen, J. C.-Y., & Kurt Kraiger, J. J. (2019). Engaging parents Involvement in K โ€“ 12 Online Learning Settings: Are We Meeting the Needs of Underserved Students? Journal of E-Learning and Knowledge Society, Vol 15 No 2 (2019): Journal of eLearning and Knowledge Society. https://doi.org/10.20368/1971-8829/1563

Cobb, C. D. (2020). Geospatial Analysis: A New Window Into Educational Equity, Access, and Opportunity. Review of Research in Education, 44(1), 97โ€“129. https://doi.org/10.3102/0091732X20907362

Rumberger, R., Addis, H., Allensworth, E., Balfanz, R., Bruch, J., Dillon, E., Duardo, D., Dynarski, M., Furgeson, J., Jayanthi, M., Newman-Gonchar, R., Place, K., & Tuttle, C. (2017). Preventing Dropout in Secondary Schools (No. NCEE 2017-4028). National Center for Education Evaluation and Regional Assistance (NCEE), Institute of Education Sciences, U.S. Department of Education. https://whatworks.ed.gov

Sava, S., Bunoiu, M., & Malita, L. (2017). Ways to Improve Studentsโ€™ Decision for Academic Studies. Acta Didactica Napocensia, 10(4), 109โ€“120. https://doi.org/10.24193/adn.10.4.11

Siegle, D., Gubbins, E. J., Oโ€™Rourke, P., Langley, S. D., Mun, R. U., Luria, S. R., Little, C. A., McCoach, D. B., Knupp, T., Callahan, C. M., & Plucker, J. A. (2016). Barriers to Underserved Studentsโ€™ Participation in Gifted Programs and Possible Solutions. Journal for the Education of the Gifted, 39(2), 103โ€“131. https://doi.org/10.1177/0162353216640930

The California School Climate, Health, and Learning Survey (CalSCHLS) Systemโ€”Home. (n.d.). Retrieved October 26, 2025, from https://calschls.org/

About

End-to-end applied data science project integrating multiple public education datasets to model and interpret factors influencing student graduation outcomes.

Resources

License

Contributing

Stars

Watchers

Forks

Contributors 3

  •  
  •  
  •