A Machine Learning Approach to Identifying At-Risk California Public High Schools Using Public, Non-PII Data
This project is a part of the ADS-599 course in the Applied Data Science Program at the University of San Diego.
-- Project Status: Completed
You can read the full capstone whitepaper here:
โก๏ธ California Early Warning System โ Final Whitepaper
To use this project, first clone the repo on your device using the command below:
git init
git clone https://github.com/junclemente/msads_capstone.gitThis project uses a conda environment specified in a YAML file for reproducibility and consistent development. Ensure you have Anaconda or Miniconda installed.
Create the Environment
Run the following:
conda env create -f environment.ymlUpdate the Environment (if needed)
If there are any updates to the environment, you can update the environment with the following:
conda env update -f environment.yml --pruneThe --prune option cleans the environment by removing packages that are no longer required.
msads_capstone/
โโโ .github/
โโโ app/
โโโ code_library/
โโโ data/
โโโ docs/
โโโ media/
โโโ models/
โโโ CONTRIBUTING.md
โโโ environment.yml
โโโ LICENSE
โโโ main_notebook.md
โโโ README.md
All Jupyter notebooks are located inside the
code_library/folder. This was intentionally kept in place to preserve stable import paths and ensure all notebooks run without modification.The
code_library/folder contains both reusable Python utilities (helper.py) and all project notebooks for data collection, preparation, exploration, and modeling.
Click the following link to run the web version: http://ca-early-warning-system.streamlit.app
๐ฏ School-Level Graduation Risk Prediction Users can interactively adjust key predictors to instantly estimate whether a school is At Risk or On Track.
๐ Real-Time Model Output
The app automatically displays:
- Predicted risk category
- Model confidence / probability
๐งฎ Interactive Scenario Exploration Users can simulate โwhat-ifโ scenarios such as:
- What if chronic absenteeism decreases?
- What if FRPM eligibility drops by 10%?
- How does the student-to-support-staff ratio impact graduation outcomes?
- Clone this repository.
- Create the conda environment.
- Activate the conda environment and run the streamlit application:
conda activate capstone streamlit run app/Home.py
The main purpose of this project is to develop a school-level Early Warning System (EWS) that identifies California public high schools at risk of low graduation outcomes using only public, non-PII datasets. By leveraging statewide indicators aligned with the ABC frameworkโAttendance, Behavior, and Course performanceโthis project demonstrates that actionable early-warning signals can be generated without relying on restricted student-level records.
The goal is to provide California educators, policymakers, and district leaders with a scalable, transparent, and privacy-preserving tool for monitoring emerging risk, understanding systemic inequities, and supporting data-informed planning and resource allocation.
|
|
|
|
This project develops a simplified Early Warning System (EWS) that predicts high-school graduation outcomes using only publicly accessible datasets from the California Department of Education (CDE). Using 2021โ22 school-level indicatorsโsuch as graduation rates, chronic absenteeism, FRPM eligibility, teacher experience, and school characteristicsโcombined with county-level climate data, we built a cleaned modeling dataset of 958 schools and 25 predictors. A binary target (โAt Riskโ < 90% graduation rate) highlighted a 26.3% minority class, and analysis confirmed strong ABC-aligned patterns between absenteeism, socioeconomic disadvantage, teacher experience, and graduation outcomes. Multiple machine learning models were evaluated with PR-AUC, Precision, Recall, and F1 due to class imbalance; Random Forest and Logistic Regression performed best, with top predictors including chronic absenteeism, unexcused absences, FRPM eligibility, still-enrolled rate, and AโG completion rate. Key challenges included inconsistent county reporting, FERPA-related suppression, and missing climate indicators, though all data were aggregate and fully public.
This project integrates multiple publicly accessible, non-PII datasets from the California Department of Education (CDE) and CalSCHLS to build a unified school-level dataset for modeling graduation outcomes. All data represent the 2021โ22 school year, except for the CalSCHLS climate data (2017โ19), which is the most recent available.
- Total schools: 958 California public high schools
- Predictor variables: 25 engineered and cleaned features
- Target variable:
- At Risk (1): Graduation rate < 90%
- On Track (0): Graduation rate โฅ 90%
- Class balance:
- On Track: 73.7%
- At Risk: 26.3%
Below are the official public websites where all raw datasets used in this project can be downloaded:
|
|
Using the top-15 features selected through Random Forest feature importance, seven classification models were evaluated on a stratified test set (20% split). Performance was compared using the PR-AUC as the primary metric due to class imbalance, with Precision, Recall, and F1-Score also included.
Model Comparison (PR-AUC)
The highest performing models were:
- Random Forest - PR-AUC 0.775
- Logistic Regression - PR-AUC 0.763
- Naive Bayes - PR-AUC 0.755
Overall Finding
Interpretability of the chosen model was not the highest priority since the predictions do not directly impact individual students. The Random Forest model delivered the strongest balance of precision and recall and the highest PR-AUC, making it the most reliable classifier for identifying schools At Risk of low graduation rates.
The following models were tested to compare predictive effectiveness under class imbalance conditions using PR-AUC as the primary evaluatoin metric.
- Logistic Regression
- Naive Bayes
- Random Forest
- XGBoost
- SVM
- Decision Tree
- KNN
Summary of model performance:
| Model | Precision | Recall | F1-Score | PR-AUC |
|---|---|---|---|---|
| Random Forest | 0.720 | 0.706 | 0.713 | 0.775 |
| Logistic Regression | 0.549 | 0.765 | 0.639 | 0.763 |
| Naive Bayes | 0.547 | 0.686 | 0.609 | 0.755 |
| XGBoost | 0.702 | 0.647 | 0.673 | 0.707 |
| Decision Tree | 0.500 | 0.725 | 0.592 | 0.548 |
| SVM | 1.000 | 0.059 | 0.111 | 0.533 |
| KNN | 0.517 | 0.294 | 0.375 | 0.397 |
This project is licensed under the MIT License.
We thank the University of San Diegoโs Applied Data Science faculty for their support and feedback throughout the ADS-599 Capstone. We also acknowledge the California Department of Education (CDE) for providing publicly accessible datasets on graduation outcomes, absenteeism, staffing, and school demographics, as well as the CalSCHLS/WestEd teams for making county-level school climate data publicly available. Their commitment to open data enabled us to build a fully reproducible, school-level Early Warning System.
We also appreciate the collaborative contributions of our teammatesโAmayrani Balbuena, Tanya Ortega, and Jun Clementeโin data collection, analysis, modeling, and application development.
Portions of this project, including selected code snippets, debugging suggestions, and explanatory text, were developed with the assistance of ChatGPT by OpenAI. The authors used AI tools to accelerate brainstorming, refine documentation, and troubleshoot code behavior.
All AI-generated material was manually reviewed, tested, and edited by the authors to ensure correctness, accuracy, and alignment with the project requirements.
Austin, G., Hanson, T., Bala, N., & Zheng, C. (2023). Student engagement and well-being in California, 2019-21: Results of the Eighteenth Biennial State California Healthy Kids Survey, Grades 7, 9, and 11. WestEd. https://data.calschls.org/resources/18th_Biennial_State_1921.pdf
California Department of Education. (n.d.). Retrieved October 26, 2025, from https://www.cde.ca.gov/
Chen, T., Wanberg, R. C., Gouioa, E. T., Brown, M. J. S., Chen, J. C.-Y., & Kurt Kraiger, J. J. (2019). Engaging parents Involvement in K โ 12 Online Learning Settings: Are We Meeting the Needs of Underserved Students? Journal of E-Learning and Knowledge Society, Vol 15 No 2 (2019): Journal of eLearning and Knowledge Society. https://doi.org/10.20368/1971-8829/1563
Cobb, C. D. (2020). Geospatial Analysis: A New Window Into Educational Equity, Access, and Opportunity. Review of Research in Education, 44(1), 97โ129. https://doi.org/10.3102/0091732X20907362
Rumberger, R., Addis, H., Allensworth, E., Balfanz, R., Bruch, J., Dillon, E., Duardo, D., Dynarski, M., Furgeson, J., Jayanthi, M., Newman-Gonchar, R., Place, K., & Tuttle, C. (2017). Preventing Dropout in Secondary Schools (No. NCEE 2017-4028). National Center for Education Evaluation and Regional Assistance (NCEE), Institute of Education Sciences, U.S. Department of Education. https://whatworks.ed.gov
Sava, S., Bunoiu, M., & Malita, L. (2017). Ways to Improve Studentsโ Decision for Academic Studies. Acta Didactica Napocensia, 10(4), 109โ120. https://doi.org/10.24193/adn.10.4.11
Siegle, D., Gubbins, E. J., OโRourke, P., Langley, S. D., Mun, R. U., Luria, S. R., Little, C. A., McCoach, D. B., Knupp, T., Callahan, C. M., & Plucker, J. A. (2016). Barriers to Underserved Studentsโ Participation in Gifted Programs and Possible Solutions. Journal for the Education of the Gifted, 39(2), 103โ131. https://doi.org/10.1177/0162353216640930
The California School Climate, Health, and Learning Survey (CalSCHLS) SystemโHome. (n.d.). Retrieved October 26, 2025, from https://calschls.org/


