Skip to content

danisala664/Hi-ckathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PISA Student Math Score Prediction

A machine learning pipeline for predicting student math performance using PISA (Programme for International Student Assessment) survey data across multiple years.

Overview

This group project was developed during the HI!CKATHON Hi! Paris hackathon. It predicts student math scores based on PISA international survey data covering 1.7 million students and 250+ variables. The pipeline handles data from three survey years (2015, 2018, 2022) with year-specific preprocessing and model training.

Key Features

  • Multi-year data processing (2015, 2018, 2022)
  • Target encoding for country-level features with K-Fold cross-validation
  • KNN imputation for age-related missing values
  • Automated feature selection based on model importance
  • Support for FLAML AutoML and manual hyperparameter tuning
  • Feature importance analysis across years

Project Structure

├── src/
│   ├── config.py                    # Paths, constants, column definitions
│   ├── data_cleaning/
│   │   ├── cleaning.py              # NaN handling, median/mode imputation
│   │   ├── imputers.py              # KNN imputation, effort-grade imputation
│   │   ├── feature_engineering.py   # Score aggregates, ESCS proxy, weighted averages
│   │   └── preprocessing.py         # YearPreprocessor class, full pipeline
│   └── training/
│       ├── models.py                # Model definitions (XGBoost, LightGBM)
│       └── train.py                 # Training loop, AutoML, evaluation
├── data/
│   ├── raw/                         # Original CSV files
│   └── processed/                   # Parquet files, preprocessors, top features
├── models/                          # Saved model files (.pkl)
├── main_new.py                      # Entry point with CLI arguments
├── feature_selection.py             # Feature importance analysis script
└── predict.py                       # Prediction script for test data

Methods

Preprocessing Pipeline

  1. Filter data by survey year
  2. Drop math-related features (to avoid data leakage)
  3. Binary encoding for categorical variables (ADMINMODE, OECD, gender)
  4. Target encoding for country (CNT) with smoothing
  5. KNN imputation for student age
  6. Median imputation for numeric columns, mode for categorical
  7. Feature engineering: score aggregates, ESCS socioeconomic proxy, weighted question averages
  8. Drop low-variance and timing columns
  9. Feature selection using top-N important features

Models

  • XGBoost Regressor (primary model)
  • LightGBM Regressor
  • FLAML AutoML (optional, searches across multiple estimators)

Evaluation

  • RMSE (Root Mean Squared Error)
  • R-squared score
  • 80/20 train-validation split

Technologies

  • Python
  • XGBoost / LightGBM
  • Scikit-learn
  • FLAML (AutoML)
  • Pandas / NumPy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages