Data Science Experiments

A collection of data science projects for machine learning, web scraping, and graph network analysis using Python, scikit-learn, pandas, numpy, seaborn, matplotlib, pyvis, networkx.

🎯 Project Overview

Project	Dataset	Techniques	Key Results
Student Grade Prediction	Moodle Learning Analytics	Classification, Feature Engineering, Ensemble Models	Achieved 73% accuracy with Random Forest
Graph Networks	WikiArt Artist-Movement Dataset	Network Analysis, PageRank, Eigenvector Centrality	Identified 2,192 artists & movements
Web Scraping & Analytics	Booking.com + Trivago (NYC Hotels)	Selenium, BeautifulSoup, EDA	Scraped 155 hotels with pricing analytics

📊 1. Student Grade Prediction

Problem Statement:
Built classification models to predict student final grades (0-5 scale) using 47 features from a Moodle learning management system.

Workflow:

Data Preprocessing:
- Handled 107 samples with zero missing values
- Feature distribution analysis revealed bimodal grade distribution (peaks at 0 and 4)
- Applied 3-stage feature selection pipeline:
  1. Variance Threshold (1% threshold) → Removed 1 low-variance feature
  2. Sparsity Filter (70% threshold) → Removed 10 sparse features
  3. Correlation Filter (0.95 threshold) → Removed 2 highly correlated features
- Final feature set: 32 features (13 removed)
Feature Engineering:
- SelectKBest (f_classif for Logistic Regression, mutual_info_classif for Random Forest)
- Top features: Week5MP2, Week7MP3, Week3MP1 (assignment scores)
- Activity statistics (Stat0-Stat3) captured engagement patterns
Modeling:
- Logistic Regression (L2 regularization, max_iter=1000)
- Random Forest (100 trees, max_depth=10)
- 5-fold cross-validation for hyperparameter tuning

Results:

Model	Accuracy	Precision (Macro)	Recall (Macro)	F1 (Macro)
Logistic Regression	72.7%	73.0%	73.3%	67.1%
Random Forest	72.7%	68.2%	66.7%	65.7%

View Notebook →

🕸️ 2. Graph Networks

Problem Statement:
Analyzed artist influence networks using WikiArt dataset to identify the most influential art movements and artists through graph centrality metrics.

Workflow:

Data:
- 2,192 artists connected through influence relationships
- 75+ art movements (Romanticism, Impressionism, Expressionism, etc.)
- 2,996 edges representing artist-to-artist and artist-to-movement connections
Graph Construction:
- Built bipartite graph (Artists ↔ Movements)
- Created movement-to-movement network via weighted projection
Centrality Analysis:
- Degree Centrality: Romanticism (254), Impressionism (209), Expressionism (201)
- Eigenvector Centrality: Expressionism (0.276), Abstract Art (0.269), Surrealism (0.249)
- Nationality Distribution: American (524 artists), French (402), Italian (270)

Results:

Bipartite density: 0.0081 (sparse network indicating specialized movements)
Movement-movement density: 0.0437
Top influential artists: Vincent van Gogh, Pablo Picasso, Rembrandt (by PageRank)

View Notebook →

🌐 3. Web Scraping & Analytics

Problem Statement:
Automated hotel data collection from Booking.com and Trivago to compare pricing, ratings, and availability for NYC hotels (check-in: Nov 15, 2025 | check-out: Nov 17, 2025).

Workflow:

Data Collection:
- Selenium WebDriver (headless Chrome with stealth mode)
- BeautifulSoup for HTML parsing
- Handled dynamic content, popups, and pagination
Preprocessing:
- Standardized distances (miles → km)
- Cleaned price strings ($XXX → float)
- Normalized rating descriptions (Very good → very good)
Exploratory Analysis:
- Price distribution (min: $87.50, max: $488)
- Distance from downtown (0.3-7.24 km)
- Rating correlation with reviews

Results:

Records Scraped: 155 hotels
Average Price: $210 per stay
Average Rating: 8.1/10
Key Finding: Hotels within 1 km of Times Square cost 25% more on average

View Notebook →

🛠️ Tech Stack

Languages: Python 3.12
Libraries:
- ML: Scikit-learn, Pandas, NumPy
- Graph Analysis: NetworkX, Vis.js
- Web Scraping: Selenium, BeautifulSoup, Requests
- Visualization: Matplotlib, Seaborn
Tools: Jupyter Notebook, Git, Google Colab

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Experiments

🎯 Project Overview

📊 1. Student Grade Prediction

🕸️ 2. Graph Networks

🌐 3. Web Scraping & Analytics

🛠️ Tech Stack

About

Uh oh!

Releases

Packages

Languages

rajshreeee/data-science-experiments

Folders and files

Latest commit

History

Repository files navigation

Data Science Experiments

🎯 Project Overview

📊 1. Student Grade Prediction

🕸️ 2. Graph Networks

🌐 3. Web Scraping & Analytics

🛠️ Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages