Skip to content

rajshreeee/data-science-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Science Experiments

A collection of data science projects for machine learning, web scraping, and graph network analysis using Python, scikit-learn, pandas, numpy, seaborn, matplotlib, pyvis, networkx.


🎯 Project Overview

Project Dataset Techniques Key Results
Student Grade Prediction Moodle Learning Analytics Classification, Feature Engineering, Ensemble Models Achieved 73% accuracy with Random Forest
Graph Networks WikiArt Artist-Movement Dataset Network Analysis, PageRank, Eigenvector Centrality Identified 2,192 artists & movements
Web Scraping & Analytics Booking.com + Trivago (NYC Hotels) Selenium, BeautifulSoup, EDA Scraped 155 hotels with pricing analytics

πŸ“Š 1. Student Grade Prediction

Problem Statement:
Built classification models to predict student final grades (0-5 scale) using 47 features from a Moodle learning management system.

Workflow:

  • Data Preprocessing:

    • Handled 107 samples with zero missing values
    • Feature distribution analysis revealed bimodal grade distribution (peaks at 0 and 4)
    • Applied 3-stage feature selection pipeline:
      1. Variance Threshold (1% threshold) β†’ Removed 1 low-variance feature
      2. Sparsity Filter (70% threshold) β†’ Removed 10 sparse features
      3. Correlation Filter (0.95 threshold) β†’ Removed 2 highly correlated features
    • Final feature set: 32 features (13 removed)
  • Feature Engineering:

    • SelectKBest (f_classif for Logistic Regression, mutual_info_classif for Random Forest)
    • Top features: Week5MP2, Week7MP3, Week3MP1 (assignment scores)
    • Activity statistics (Stat0-Stat3) captured engagement patterns
  • Modeling:

    • Logistic Regression (L2 regularization, max_iter=1000)
    • Random Forest (100 trees, max_depth=10)
    • 5-fold cross-validation for hyperparameter tuning

Results:

Model Accuracy Precision (Macro) Recall (Macro) F1 (Macro)
Logistic Regression 72.7% 73.0% 73.3% 67.1%
Random Forest 72.7% 68.2% 66.7% 65.7%

View Notebook β†’


πŸ•ΈοΈ 2. Graph Networks

Problem Statement:
Analyzed artist influence networks using WikiArt dataset to identify the most influential art movements and artists through graph centrality metrics.

Workflow:

  • Data:

    • 2,192 artists connected through influence relationships
    • 75+ art movements (Romanticism, Impressionism, Expressionism, etc.)
    • 2,996 edges representing artist-to-artist and artist-to-movement connections
  • Graph Construction:

    • Built bipartite graph (Artists ↔ Movements)
    • Created movement-to-movement network via weighted projection
  • Centrality Analysis:

    • Degree Centrality: Romanticism (254), Impressionism (209), Expressionism (201)
    • Eigenvector Centrality: Expressionism (0.276), Abstract Art (0.269), Surrealism (0.249)
    • Nationality Distribution: American (524 artists), French (402), Italian (270)

Results:

  • Bipartite density: 0.0081 (sparse network indicating specialized movements)
  • Movement-movement density: 0.0437
  • Top influential artists: Vincent van Gogh, Pablo Picasso, Rembrandt (by PageRank)

View Notebook β†’


🌐 3. Web Scraping & Analytics

Problem Statement:
Automated hotel data collection from Booking.com and Trivago to compare pricing, ratings, and availability for NYC hotels (check-in: Nov 15, 2025 | check-out: Nov 17, 2025).

Workflow:

  • Data Collection:

    • Selenium WebDriver (headless Chrome with stealth mode)
    • BeautifulSoup for HTML parsing
    • Handled dynamic content, popups, and pagination
  • Preprocessing:

    • Standardized distances (miles β†’ km)
    • Cleaned price strings ($XXX β†’ float)
    • Normalized rating descriptions (Very good β†’ very good)
  • Exploratory Analysis:

    • Price distribution (min: $87.50, max: $488)
    • Distance from downtown (0.3-7.24 km)
    • Rating correlation with reviews

Results:

  • Records Scraped: 155 hotels
  • Average Price: $210 per stay
  • Average Rating: 8.1/10
  • Key Finding: Hotels within 1 km of Times Square cost 25% more on average

View Notebook β†’


πŸ› οΈ Tech Stack

  • Languages: Python 3.12

  • Libraries:

    • ML: Scikit-learn, Pandas, NumPy
    • Graph Analysis: NetworkX, Vis.js
    • Web Scraping: Selenium, BeautifulSoup, Requests
    • Visualization: Matplotlib, Seaborn
  • Tools: Jupyter Notebook, Git, Google Colab

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published