Skip to content

merhametsize/pulsar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌌 Pulsar Candidate Classification from Scratch

img1.png img2.png

This repository documents an academic project focusing on the classification of pulsar candidates from astronomical survey data. The goal is to differentiate true pulsars from spurious signals (Radio Frequency Interference/Noise).

🌟 Project Goal & Methodology

The objective is to accurately perform binary classification on highly imbalanced astronomical data. The methodology includes:

  1. Z-Normalization: All eight input features are standardized to handle differing scales, means, and variances.
  2. Comparative Modeling: A wide range of statistical and machine learning classifiers were implemented and rigorously tested.
  3. Cost-Sensitive Evaluation: The models are evaluated primarily using the Minimum Detection Cost Function (minDCF) across various effective prior probabilities (Ο€), recognizing the high cost associated with missing a true, rare pulsar (a false negative).

πŸ“Š Dataset Overview

The project uses the HTRU2 dataset, which contains statistical features derived from pulsar candidates collected during the High Time Resolution Universe Survey.

Feature Description
Profile Statistics (4) Mean, Standard Deviation, Excess Kurtosis, and Skewness of the integrated pulse profile.
DM-SNR Curve Statistics (4) Mean, Standard Deviation, Excess Kurtosis, and Skewness of the DM-SNR curve.

The dataset is highly imbalanced, with 16,259 spurious examples (RFI/noise) and 1,639 real pulsar examples.

🧠 Implemented Classification Models

The following models were developed and analyzed for their performance, particularly focusing on their efficacy in low-prior conditions (Ο€=0.1):

  1. Multi-Variate Gaussian (MVG) Classifiers: Tested various covariance types, including the robust Tied Full Covariance model.
  2. Linear Logistic Regression (LR): Implemented with different regularization strengths (Ξ») to control model complexity and overfitting.
  3. Support Vector Machines (SVM): Utilized a Linear Kernel with balancing techniques and hyperparameter tuning (C) to handle the skewed dataset.
  4. Gaussian Mixture Models (GMM): Explored models with complex structures (e.g., Full Covariance with 8 components) to capture non-linear class boundaries.

πŸ“ˆ Key Results and Findings

The analysis successfully identified the most effective classifiers for minimizing the detection cost in a rare-event search scenario.

Model minDCF (Ο€=0.5) minDCF (Ο€=0.1) minDCF (Ο€=0.9)
MVG Tied Full Cov 0.109 0.207 0.590
Linear LR (Ξ»=0) 0.107 0.198 0.542
**Linear SVM (**C=0 bal.) 0.104 0.197 0.530
GMM Full Cov 8 0.105 0.197 0.535
  • Overall Best Performance: Linear SVM with C=0 (balanced) and GMM Full Covariance (8 components) consistently achieved the lowest Minimum Detection Cost across all tested prior probabilities, demonstrating superior robustness and accuracy for this challenging, imbalanced dataset.

πŸ› οΈ Technology Stack & Requirements

  • Language: Python 3.x
  • Core Libraries: numpy, pandas, matplotlib

πŸ“„ Full Analysis

A complete explanation of the mathematical models, parameter choices, experimental procedures, and conclusions is available in the project's presentation:
pulsar.pdf

🀝 Contributing

This is a completed academic project; however, feedback on the methodology or analysis is welcome.
Project by Gabriele Cassetta

About

Pulsar stars detection through machine learning algorithms developed from scratch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages