This repository documents an academic project focusing on the classification of pulsar candidates from astronomical survey data. The goal is to differentiate true pulsars from spurious signals (Radio Frequency Interference/Noise).
The objective is to accurately perform binary classification on highly imbalanced astronomical data. The methodology includes:
- Z-Normalization: All eight input features are standardized to handle differing scales, means, and variances.
- Comparative Modeling: A wide range of statistical and machine learning classifiers were implemented and rigorously tested.
- Cost-Sensitive Evaluation: The models are evaluated primarily using the Minimum Detection Cost Function (minDCF) across various effective prior probabilities (Ο), recognizing the high cost associated with missing a true, rare pulsar (a false negative).
The project uses the HTRU2 dataset, which contains statistical features derived from pulsar candidates collected during the High Time Resolution Universe Survey.
| Feature | Description |
|---|---|
| Profile Statistics (4) | Mean, Standard Deviation, Excess Kurtosis, and Skewness of the integrated pulse profile. |
| DM-SNR Curve Statistics (4) | Mean, Standard Deviation, Excess Kurtosis, and Skewness of the DM-SNR curve. |
The dataset is highly imbalanced, with 16,259 spurious examples (RFI/noise) and 1,639 real pulsar examples.
The following models were developed and analyzed for their performance, particularly focusing on their efficacy in low-prior conditions (Ο=0.1):
- Multi-Variate Gaussian (MVG) Classifiers: Tested various covariance types, including the robust Tied Full Covariance model.
- Linear Logistic Regression (LR): Implemented with different regularization strengths (Ξ») to control model complexity and overfitting.
- Support Vector Machines (SVM): Utilized a Linear Kernel with balancing techniques and hyperparameter tuning (C) to handle the skewed dataset.
- Gaussian Mixture Models (GMM): Explored models with complex structures (e.g., Full Covariance with 8 components) to capture non-linear class boundaries.
The analysis successfully identified the most effective classifiers for minimizing the detection cost in a rare-event search scenario.
| Model | minDCF (Ο=0.5) | minDCF (Ο=0.1) | minDCF (Ο=0.9) |
|---|---|---|---|
| MVG Tied Full Cov | 0.109 | 0.207 | 0.590 |
| Linear LR (Ξ»=0) | 0.107 | 0.198 | 0.542 |
| **Linear SVM (**C=0 bal.) | 0.104 | 0.197 | 0.530 |
| GMM Full Cov 8 | 0.105 | 0.197 | 0.535 |
- Overall Best Performance: Linear SVM with C=0 (balanced) and GMM Full Covariance (8 components) consistently achieved the lowest Minimum Detection Cost across all tested prior probabilities, demonstrating superior robustness and accuracy for this challenging, imbalanced dataset.
- Language: Python 3.x
- Core Libraries: numpy, pandas, matplotlib
A complete explanation of the mathematical models, parameter choices, experimental procedures, and conclusions is available in the project's presentation:
pulsar.pdf
This is a completed academic project; however, feedback on the methodology or analysis is welcome.
Project by Gabriele Cassetta

