Skip to content

jbwashington/school-stats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

School Stats Platform

A dedicated data collection and API service for comprehensive NCAA athletic program data. This platform specializes in scraping, processing, and serving athletic staff data via authenticated API endpoints.

Production URL: https://us-schools-data.vercel.app

Quick Start

Prerequisites

Initial Setup

# 1. Clone the repository
git clone <your-school-stats-repo>
cd school-stats

# 2. Install dependencies
bun install

# 3. Set up environment variables
cp .env.example .env.local
# Edit .env.local with your actual values

# 4. Start Supabase locally
supabase start

# 5. Run database migrations
bun db:migrate

# 6. Generate TypeScript types
bun db:gen-types

# 7. Start development server
bun dev

Environment Configuration

Edit .env.local with your Supabase and API keys:

# Get these from: supabase status
NEXT_PUBLIC_SUPABASE_URL="http://localhost:54321"
NEXT_PUBLIC_SUPABASE_ANON_KEY="your-anon-key-from-supabase-status"
SUPABASE_SERVICE_ROLE_KEY="your-service-role-key-from-supabase-status"

# Get API keys from respective services
FIRECRAWL_API_KEY="fc-your-firecrawl-api-key"
TOGETHER_API_KEY="your-together-ai-key"

# Generate secure random strings
API_SECRET_KEY="your-32-character-secret-key"
ADMIN_API_KEY="your-admin-api-key"

API Usage

Authentication

All API endpoints require authentication via API key in the Authorization header:

# Local development
curl -H "Authorization: Bearer your-api-key" \
  http://localhost:3000/api/schools

# Production
curl -H "Authorization: Bearer your-api-key" \
  https://us-schools-data.vercel.app/api/schools

Test API Keys

The system includes pre-configured test keys:

  • Read-only: school_stats_test_key_12345678901234567890
  • Admin: school_stats_admin_key_98765432109876543210

Core Endpoints

Get Schools

# Get all schools
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/schools"

# Filter by conference and state
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/schools?conference=SEC&state=Alabama"

Get Athletic Staff

# Get staff for a specific school
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/schools/8/staff"

# Filter by sport and title
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/schools/8/staff?sport=Football&title=head-coach"

Search Staff Across Schools

# Search for coaches by name and sport
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/staff/search?name=john&sport=basketball"

Trigger Scraping (Admin Only)

# Start hybrid scraping job
curl -X POST \
  -H "Authorization: Bearer school_stats_admin_key_98765432109876543210" \
  -H "Content-Type: application/json" \
  -d '{"method": "hybrid", "school_ids": [8, 15, 23]}' \
  http://localhost:3000/api/admin/scrape

Data Sources

The platform aggregates data from multiple authoritative sources to provide comprehensive school and athletic information:

1. College Scorecard API

Official data from the U.S. Department of Education's College Scorecard API:

  • Institution Information: School names, locations, website URLs
  • Carnegie Classifications: Research activity levels, institution types
  • Enrollment Data: Student population, demographics, admission rates
  • Academic Programs: Degree types offered, program availability
  • Financial Data: Tuition costs, financial aid statistics
  • Outcomes: Graduation rates, median earnings, debt levels

API Endpoint: https://api.data.gov/ed/collegescorecard/v1/schools

# Import College Scorecard data
COLLEGE_SCORECARD_API_KEY="your-api-key" bun scripts/import-college-scorecard.ts

# Bulk import with enhanced matching
bun scripts/import-scorecard-bulk.ts

Coverage: 6,000+ institutions including all Title IV eligible schools

2. IPEDS (Integrated Postsecondary Education Data System)

Official statistics from the National Center for Education Statistics:

  • UNITID: Unique identifier for institution matching
  • Institutional Characteristics: Control (public/private), size, setting
  • Geographic Data: FIPS codes, locale codes, census regions
  • Institutional Categories: Sector classifications, degree-granting status
  • Historical Data: Multi-year trends and institutional changes

Source: IPEDS CSV data files from NCES

# Import IPEDS data
bun import:ipeds

# Alternatively, use the script directly
bun scripts/import-ipeds-data.ts

Coverage: All postsecondary institutions that participate in federal financial aid programs

3. NCAA Official Directory

Official NCAA member institution data:

  • Complete NCAA Schools: Full directory of NCAA member institutions
  • Conference Affiliations: Official conference memberships and divisions
  • Athletic Websites: Verified URLs for athletic department websites
  • Division Classifications: Division I, II, III designations
  • Sport Sponsorships: Sports offered by each institution

Source: NCAA member directory and official publications

# Import NCAA directory data
bun scripts/import-ncaa-directory.ts

# Process NCAA datasets
bun process:ncaa-data

Coverage: 1,100+ NCAA member institutions across all divisions

4. Athletic Website Scraping

Real-time data extraction from school athletic websites:

  • Athletic Staff: Coaches, administrators, support staff
  • Contact Information: Email addresses, phone numbers
  • Staff Biographies: Background, experience, achievements
  • Sport Assignments: Sport-specific coaching staffs
  • Position Titles: Standardized role classifications

Methods:

  • Firecrawl API: AI-powered scraping for accessible sites (90%+ success)
  • Puppeteer + Stealth: Advanced anti-bot evasion for protected sites (100% access)
  • Hybrid Mode: Automatic method selection per school
# Run hybrid scraping
bun scrape:hybrid

# Run specific methods
bun scrape:firecrawl
bun scrape:puppeteer

Coverage: 1,000+ athletic department websites with continuous updates

5. Sports Reference Data

Curated sports classification and metadata:

  • Sports Classifications: Standardized sport names and categories
  • Performance Metrics: Available statistics per sport
  • MaxPreps Integration: Sport coverage and recruitment priority
  • Gender Categories: Men's, women's, co-ed sport classifications

Source: Internal curation and sports data standards

# Validate datasets
bun validate:datasets

# View dataset documentation
cat datasets/README.md

Data Integration Flow

College Scorecard API
        ↓
    [Match by Name/Location]
        ↓
IPEDS Data (UNITID matching)
        ↓
    [Enhance with]
        ↓
NCAA Directory (Athletic focus)
        ↓
    [Real-time updates from]
        ↓
Athletic Website Scraping
        ↓
    [Enrich with]
        ↓
Sports Reference Metadata
        ↓
  [Final Database]

Data Refresh Schedule

  • College Scorecard: Annual updates (typically September)
  • IPEDS: Annual updates (released in phases throughout year)
  • NCAA Directory: Semi-annual updates (fall/spring)
  • Athletic Website Scraping: Weekly automated runs
  • Sports Reference: Quarterly manual reviews

API Keys Required

# College Scorecard
COLLEGE_SCORECARD_API_KEY="your-api-key-here"
# Get yours at: https://api.data.gov/signup/

# Firecrawl (for web scraping)
FIRECRAWL_API_KEY="fc-your-api-key"
# Get yours at: https://firecrawl.dev

# Together AI (optional, for enhanced extraction)
TOGETHER_API_KEY="your-together-api-key"

Data Collection

Manual Data Import

# Import all data sources in sequence
bun scripts/import-college-scorecard.ts
bun scripts/import-ipeds-data.ts
bun scripts/import-ncaa-directory.ts

# Process and validate
bun process:ncaa-data
bun validate:datasets

Scraping Methods

The platform supports three scraping approaches:

  1. Firecrawl - Fast, works well for accessible sites
  2. Puppeteer - Advanced anti-bot evasion for blocked sites
  3. Hybrid - Automatically chooses best method per school

Manual Scraping

# Run hybrid scraping on all schools
bun scrape:hybrid

# Run specific method
bun scrape:firecrawl
bun scrape:puppeteer

# Add new schools to database
bun migrate:schools

# Monitor data quality
bun monitor:data-quality

Scraping Performance

  • Small/Mid Schools: 90%+ success with Firecrawl
  • Major Programs: 75%+ success with Puppeteer (Alabama, UCLA, etc.)
  • Overall Hybrid: ~87% expected success rate

Development

Available Scripts

# Development
bun dev              # Start dev server
bun build            # Build for production
bun start            # Start production server

# Database
bun db:migrate       # Run migrations
bun db:seed          # Seed test data
bun db:reset         # Reset database
bun db:gen-types     # Generate TypeScript types

# Data Collection
bun scrape:hybrid    # Hybrid scraping
bun scrape:firecrawl # Firecrawl only
bun scrape:puppeteer # Puppeteer only

# Dataset Management
bun process:ncaa-data        # Process NCAA CSV files
bun validate:datasets        # Validate dataset quality
bun migrate:schools          # Add major NCAA schools

# Monitoring
bun monitor:data-quality        # Data quality check
bun analyze:blocked-schools     # Identify blocked schools

# Testing
bun test             # Run unit tests
bun test:e2e         # Run E2E tests
bun lint             # Lint code
bun typecheck        # TypeScript check

Project Structure

school-stats/
├── app/
│   ├── api/                 # API endpoints
│   │   ├── schools/         # School data APIs
│   │   ├── staff/           # Staff search APIs
│   │   ├── scrape/          # Scraping triggers
│   │   └── admin/           # Admin operations
│   └── dashboard/           # Optional admin dashboard
├── lib/
│   ├── scraping/            # Scraping orchestration
│   ├── firecrawl/           # Firecrawl integration
│   ├── puppeteer/           # Stealth scraping
│   ├── validation/          # Data quality validation
│   ├── api/                 # API utilities
│   └── supabase/            # Database client
├── scripts/
│   ├── scraping/            # Manual scraping scripts
│   ├── data-migration/      # Data import scripts
│   └── monitoring/          # Quality monitoring
├── datasets/                # CSV datasets and documentation
│   ├── raw/                 # Original NCAA and sports data
│   ├── processed/           # Cleaned datasets and reports
│   └── README.md            # Dataset documentation
└── supabase/
    ├── migrations/          # Database schema
    └── tests/               # Database tests

Production Deployment

Live Production URL: https://us-schools-data.vercel.app

Supabase Setup

  1. Create new Supabase project
  2. Copy connection details to .env.production
  3. Run migrations: supabase db push --db-url="your-prod-db-url"

Vercel Deployment

  1. Connect repository to Vercel
  2. Set environment variables in Vercel dashboard
  3. Deploy automatically on push to main

The platform is currently deployed at: https://us-schools-data.vercel.app

Environment Variables

# Production Database
DATABASE_URL="postgresql://user:pass@host:port/school_stats"
NEXT_PUBLIC_SUPABASE_URL="https://your-project.supabase.co"
NEXT_PUBLIC_SUPABASE_ANON_KEY="your-prod-anon-key"
SUPABASE_SERVICE_ROLE_KEY="your-prod-service-role-key"

# External APIs
FIRECRAWL_API_KEY="fc-your-production-key"
TOGETHER_API_KEY="your-production-key"

# Security
API_SECRET_KEY="your-secure-32-char-secret"
ADMIN_API_KEY="your-secure-admin-key"

API Documentation

Response Format

All endpoints return consistent JSON responses:

{
  "success": true,
  "data": [...],
  "message": "Optional message",
  "metadata": {
    "timestamp": "2025-08-27T00:00:00.000Z",
    "pagination": {...},
    "statistics": {...}
  }
}

Error Handling

Error responses include helpful details:

{
  "success": false,
  "error": "Descriptive error message",
  "metadata": {
    "timestamp": "2025-08-27T00:00:00.000Z"
  }
}

Rate Limiting

  • Default: 1000 requests/hour per API key
  • Admin keys: 5000 requests/hour
  • Rate limit headers included in responses

Data Quality

Coach Record Standards

  • Names: Properly formatted (John Smith, not "john smith")
  • Titles: Standardized (Head Coach, Assistant Coach, etc.)
  • Contact Info: Valid email/phone formats when available
  • Confidence Scores: 0.6-1.0 based on extraction quality

Monitoring

The platform tracks:

  • Scraping success rates by method and school
  • Data quality metrics and confidence scores
  • API usage patterns and performance
  • Contact information coverage rates

Support

Common Issues

  1. API Key Invalid: Check key format and permissions
  2. Rate Limited: Wait for rate limit window to reset
  3. School Not Found: Verify school ID exists in database
  4. Scraping Failed: Check school website accessibility

Debugging

# Check API key status
curl -H "Authorization: Bearer your-key" \
  http://localhost:3000/api/admin/scrape

# View recent scraping runs
curl -H "Authorization: Bearer your-admin-key" \
  http://localhost:3000/api/admin/scrape?limit=5

# Test individual school scraping
bun lib/scraping/hybrid-scraper-system.ts

Contributing

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/new-feature
  3. Commit changes: git commit -am 'Add new feature'
  4. Push to branch: git push origin feature/new-feature
  5. Create Pull Request

License

Private - NCRA Platform Internal Use Only

About

NCAA Athletic Program Data Collection & API Service - Specialized scraping platform for athletic staff information

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •