A collection of data science projects for machine learning, web scraping, and graph network analysis using Python, scikit-learn, pandas, numpy, seaborn, matplotlib, pyvis, networkx.
| Project | Dataset | Techniques | Key Results |
|---|---|---|---|
| Student Grade Prediction | Moodle Learning Analytics | Classification, Feature Engineering, Ensemble Models | Achieved 73% accuracy with Random Forest |
| Graph Networks | WikiArt Artist-Movement Dataset | Network Analysis, PageRank, Eigenvector Centrality | Identified 2,192 artists & movements |
| Web Scraping & Analytics | Booking.com + Trivago (NYC Hotels) | Selenium, BeautifulSoup, EDA | Scraped 155 hotels with pricing analytics |
Problem Statement:
Built classification models to predict student final grades (0-5 scale) using 47 features from a Moodle learning management system.
Workflow:
-
Data Preprocessing:
- Handled 107 samples with zero missing values
- Feature distribution analysis revealed bimodal grade distribution (peaks at 0 and 4)
- Applied 3-stage feature selection pipeline:
- Variance Threshold (1% threshold) β Removed 1 low-variance feature
- Sparsity Filter (70% threshold) β Removed 10 sparse features
- Correlation Filter (0.95 threshold) β Removed 2 highly correlated features
- Final feature set: 32 features (13 removed)
-
Feature Engineering:
- SelectKBest (f_classif for Logistic Regression, mutual_info_classif for Random Forest)
- Top features:
Week5MP2,Week7MP3,Week3MP1(assignment scores) - Activity statistics (
Stat0-Stat3) captured engagement patterns
-
Modeling:
- Logistic Regression (L2 regularization, max_iter=1000)
- Random Forest (100 trees, max_depth=10)
- 5-fold cross-validation for hyperparameter tuning
Results:
| Model | Accuracy | Precision (Macro) | Recall (Macro) | F1 (Macro) |
|---|---|---|---|---|
| Logistic Regression | 72.7% | 73.0% | 73.3% | 67.1% |
| Random Forest | 72.7% | 68.2% | 66.7% | 65.7% |
Problem Statement:
Analyzed artist influence networks using WikiArt dataset to identify the most influential art movements and artists through graph centrality metrics.
Workflow:
-
Data:
- 2,192 artists connected through influence relationships
- 75+ art movements (Romanticism, Impressionism, Expressionism, etc.)
- 2,996 edges representing artist-to-artist and artist-to-movement connections
-
Graph Construction:
- Built bipartite graph (Artists β Movements)
- Created movement-to-movement network via weighted projection
-
Centrality Analysis:
- Degree Centrality: Romanticism (254), Impressionism (209), Expressionism (201)
- Eigenvector Centrality: Expressionism (0.276), Abstract Art (0.269), Surrealism (0.249)
- Nationality Distribution: American (524 artists), French (402), Italian (270)
Results:
- Bipartite density: 0.0081 (sparse network indicating specialized movements)
- Movement-movement density: 0.0437
- Top influential artists: Vincent van Gogh, Pablo Picasso, Rembrandt (by PageRank)
Problem Statement:
Automated hotel data collection from Booking.com and Trivago to compare pricing, ratings, and availability for NYC hotels (check-in: Nov 15, 2025 | check-out: Nov 17, 2025).
Workflow:
-
Data Collection:
- Selenium WebDriver (headless Chrome with stealth mode)
- BeautifulSoup for HTML parsing
- Handled dynamic content, popups, and pagination
-
Preprocessing:
- Standardized distances (miles β km)
- Cleaned price strings (
$XXXβ float) - Normalized rating descriptions (
Very goodβvery good)
-
Exploratory Analysis:
- Price distribution (min: $87.50, max: $488)
- Distance from downtown (0.3-7.24 km)
- Rating correlation with reviews
Results:
- Records Scraped: 155 hotels
- Average Price: $210 per stay
- Average Rating: 8.1/10
- Key Finding: Hotels within 1 km of Times Square cost 25% more on average
-
Languages: Python 3.12
-
Libraries:
- ML: Scikit-learn, Pandas, NumPy
- Graph Analysis: NetworkX, Vis.js
- Web Scraping: Selenium, BeautifulSoup, Requests
- Visualization: Matplotlib, Seaborn
-
Tools: Jupyter Notebook, Git, Google Colab