A machine learning pipeline for predicting student math performance using PISA (Programme for International Student Assessment) survey data across multiple years.
This group project was developed during the HI!CKATHON Hi! Paris hackathon. It predicts student math scores based on PISA international survey data covering 1.7 million students and 250+ variables. The pipeline handles data from three survey years (2015, 2018, 2022) with year-specific preprocessing and model training.
- Multi-year data processing (2015, 2018, 2022)
- Target encoding for country-level features with K-Fold cross-validation
- KNN imputation for age-related missing values
- Automated feature selection based on model importance
- Support for FLAML AutoML and manual hyperparameter tuning
- Feature importance analysis across years
├── src/
│ ├── config.py # Paths, constants, column definitions
│ ├── data_cleaning/
│ │ ├── cleaning.py # NaN handling, median/mode imputation
│ │ ├── imputers.py # KNN imputation, effort-grade imputation
│ │ ├── feature_engineering.py # Score aggregates, ESCS proxy, weighted averages
│ │ └── preprocessing.py # YearPreprocessor class, full pipeline
│ └── training/
│ ├── models.py # Model definitions (XGBoost, LightGBM)
│ └── train.py # Training loop, AutoML, evaluation
├── data/
│ ├── raw/ # Original CSV files
│ └── processed/ # Parquet files, preprocessors, top features
├── models/ # Saved model files (.pkl)
├── main_new.py # Entry point with CLI arguments
├── feature_selection.py # Feature importance analysis script
└── predict.py # Prediction script for test data
- Filter data by survey year
- Drop math-related features (to avoid data leakage)
- Binary encoding for categorical variables (ADMINMODE, OECD, gender)
- Target encoding for country (CNT) with smoothing
- KNN imputation for student age
- Median imputation for numeric columns, mode for categorical
- Feature engineering: score aggregates, ESCS socioeconomic proxy, weighted question averages
- Drop low-variance and timing columns
- Feature selection using top-N important features
- XGBoost Regressor (primary model)
- LightGBM Regressor
- FLAML AutoML (optional, searches across multiple estimators)
- RMSE (Root Mean Squared Error)
- R-squared score
- 80/20 train-validation split
- Python
- XGBoost / LightGBM
- Scikit-learn
- FLAML (AutoML)
- Pandas / NumPy