PISA Student Math Score Prediction

A machine learning pipeline for predicting student math performance using PISA (Programme for International Student Assessment) survey data across multiple years.

Overview

This group project was developed during the HI!CKATHON Hi! Paris hackathon. It predicts student math scores based on PISA international survey data covering 1.7 million students and 250+ variables. The pipeline handles data from three survey years (2015, 2018, 2022) with year-specific preprocessing and model training.

Key Features

Multi-year data processing (2015, 2018, 2022)
Target encoding for country-level features with K-Fold cross-validation
KNN imputation for age-related missing values
Automated feature selection based on model importance
Support for FLAML AutoML and manual hyperparameter tuning
Feature importance analysis across years

Project Structure

├── src/
│   ├── config.py                    # Paths, constants, column definitions
│   ├── data_cleaning/
│   │   ├── cleaning.py              # NaN handling, median/mode imputation
│   │   ├── imputers.py              # KNN imputation, effort-grade imputation
│   │   ├── feature_engineering.py   # Score aggregates, ESCS proxy, weighted averages
│   │   └── preprocessing.py         # YearPreprocessor class, full pipeline
│   └── training/
│       ├── models.py                # Model definitions (XGBoost, LightGBM)
│       └── train.py                 # Training loop, AutoML, evaluation
├── data/
│   ├── raw/                         # Original CSV files
│   └── processed/                   # Parquet files, preprocessors, top features
├── models/                          # Saved model files (.pkl)
├── main_new.py                      # Entry point with CLI arguments
├── feature_selection.py             # Feature importance analysis script
└── predict.py                       # Prediction script for test data

Methods

Preprocessing Pipeline

Filter data by survey year
Drop math-related features (to avoid data leakage)
Binary encoding for categorical variables (ADMINMODE, OECD, gender)
Target encoding for country (CNT) with smoothing
KNN imputation for student age
Median imputation for numeric columns, mode for categorical
Feature engineering: score aggregates, ESCS socioeconomic proxy, weighted question averages
Drop low-variance and timing columns
Feature selection using top-N important features

Models

XGBoost Regressor (primary model)
LightGBM Regressor
FLAML AutoML (optional, searches across multiple estimators)

Evaluation

RMSE (Root Mean Squared Error)
R-squared score
80/20 train-validation split

Technologies

Python
XGBoost / LightGBM
Scikit-learn
FLAML (AutoML)
Pandas / NumPy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PISA Student Math Score Prediction

Overview

Key Features

Project Structure

Methods

Preprocessing Pipeline

Models

Evaluation

Technologies

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
models		models
src		src
.gitignore		.gitignore
README.md		README.md
feature_selection.py		feature_selection.py
main_new.py		main_new.py
predict.py		predict.py

danisala664/Hi-ckathon

Folders and files

Latest commit

History

Repository files navigation

PISA Student Math Score Prediction

Overview

Key Features

Project Structure

Methods

Preprocessing Pipeline

Models

Evaluation

Technologies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages