School Stats Platform

A dedicated data collection and API service for comprehensive NCAA athletic program data. This platform specializes in scraping, processing, and serving athletic staff data via authenticated API endpoints.

Production URL: https://us-schools-data.vercel.app

Quick Start

Prerequisites

Bun (latest version)
Supabase CLI
Git

Initial Setup

# 1. Clone the repository
git clone <your-school-stats-repo>
cd school-stats

# 2. Install dependencies
bun install

# 3. Set up environment variables
cp .env.example .env.local
# Edit .env.local with your actual values

# 4. Start Supabase locally
supabase start

# 5. Run database migrations
bun db:migrate

# 6. Generate TypeScript types
bun db:gen-types

# 7. Start development server
bun dev

Environment Configuration

Edit .env.local with your Supabase and API keys:

# Get these from: supabase status
NEXT_PUBLIC_SUPABASE_URL="http://localhost:54321"
NEXT_PUBLIC_SUPABASE_ANON_KEY="your-anon-key-from-supabase-status"
SUPABASE_SERVICE_ROLE_KEY="your-service-role-key-from-supabase-status"

# Get API keys from respective services
FIRECRAWL_API_KEY="fc-your-firecrawl-api-key"
TOGETHER_API_KEY="your-together-ai-key"

# Generate secure random strings
API_SECRET_KEY="your-32-character-secret-key"
ADMIN_API_KEY="your-admin-api-key"

API Usage

Authentication

All API endpoints require authentication via API key in the Authorization header:

# Local development
curl -H "Authorization: Bearer your-api-key" \
  http://localhost:3000/api/schools

# Production
curl -H "Authorization: Bearer your-api-key" \
  https://us-schools-data.vercel.app/api/schools

Test API Keys

The system includes pre-configured test keys:

Read-only: school_stats_test_key_12345678901234567890
Admin: school_stats_admin_key_98765432109876543210

Core Endpoints

Get Schools

# Get all schools
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/schools"

# Filter by conference and state
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/schools?conference=SEC&state=Alabama"

Get Athletic Staff

# Get staff for a specific school
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/schools/8/staff"

# Filter by sport and title
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/schools/8/staff?sport=Football&title=head-coach"

Search Staff Across Schools

# Search for coaches by name and sport
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
  "http://localhost:3000/api/staff/search?name=john&sport=basketball"

Trigger Scraping (Admin Only)

# Start hybrid scraping job
curl -X POST \
  -H "Authorization: Bearer school_stats_admin_key_98765432109876543210" \
  -H "Content-Type: application/json" \
  -d '{"method": "hybrid", "school_ids": [8, 15, 23]}' \
  http://localhost:3000/api/admin/scrape

Data Sources

The platform aggregates data from multiple authoritative sources to provide comprehensive school and athletic information:

1. College Scorecard API

Official data from the U.S. Department of Education's College Scorecard API:

Institution Information: School names, locations, website URLs
Carnegie Classifications: Research activity levels, institution types
Enrollment Data: Student population, demographics, admission rates
Academic Programs: Degree types offered, program availability
Financial Data: Tuition costs, financial aid statistics
Outcomes: Graduation rates, median earnings, debt levels

API Endpoint: https://api.data.gov/ed/collegescorecard/v1/schools

# Import College Scorecard data
COLLEGE_SCORECARD_API_KEY="your-api-key" bun scripts/import-college-scorecard.ts

# Bulk import with enhanced matching
bun scripts/import-scorecard-bulk.ts

Coverage: 6,000+ institutions including all Title IV eligible schools

2. IPEDS (Integrated Postsecondary Education Data System)

Official statistics from the National Center for Education Statistics:

UNITID: Unique identifier for institution matching
Institutional Characteristics: Control (public/private), size, setting
Geographic Data: FIPS codes, locale codes, census regions
Institutional Categories: Sector classifications, degree-granting status
Historical Data: Multi-year trends and institutional changes

Source: IPEDS CSV data files from NCES

# Import IPEDS data
bun import:ipeds

# Alternatively, use the script directly
bun scripts/import-ipeds-data.ts

Coverage: All postsecondary institutions that participate in federal financial aid programs

3. NCAA Official Directory

Official NCAA member institution data:

Complete NCAA Schools: Full directory of NCAA member institutions
Conference Affiliations: Official conference memberships and divisions
Athletic Websites: Verified URLs for athletic department websites
Division Classifications: Division I, II, III designations
Sport Sponsorships: Sports offered by each institution

Source: NCAA member directory and official publications

# Import NCAA directory data
bun scripts/import-ncaa-directory.ts

# Process NCAA datasets
bun process:ncaa-data

Coverage: 1,100+ NCAA member institutions across all divisions

4. Athletic Website Scraping

Real-time data extraction from school athletic websites:

Athletic Staff: Coaches, administrators, support staff
Contact Information: Email addresses, phone numbers
Staff Biographies: Background, experience, achievements
Sport Assignments: Sport-specific coaching staffs
Position Titles: Standardized role classifications

Methods:

Firecrawl API: AI-powered scraping for accessible sites (90%+ success)
Puppeteer + Stealth: Advanced anti-bot evasion for protected sites (100% access)
Hybrid Mode: Automatic method selection per school

# Run hybrid scraping
bun scrape:hybrid

# Run specific methods
bun scrape:firecrawl
bun scrape:puppeteer

Coverage: 1,000+ athletic department websites with continuous updates

5. Sports Reference Data

Curated sports classification and metadata:

Sports Classifications: Standardized sport names and categories
Performance Metrics: Available statistics per sport
MaxPreps Integration: Sport coverage and recruitment priority
Gender Categories: Men's, women's, co-ed sport classifications

Source: Internal curation and sports data standards

# Validate datasets
bun validate:datasets

# View dataset documentation
cat datasets/README.md

Data Integration Flow

College Scorecard API
        ↓
    [Match by Name/Location]
        ↓
IPEDS Data (UNITID matching)
        ↓
    [Enhance with]
        ↓
NCAA Directory (Athletic focus)
        ↓
    [Real-time updates from]
        ↓
Athletic Website Scraping
        ↓
    [Enrich with]
        ↓
Sports Reference Metadata
        ↓
  [Final Database]

Data Refresh Schedule

College Scorecard: Annual updates (typically September)
IPEDS: Annual updates (released in phases throughout year)
NCAA Directory: Semi-annual updates (fall/spring)
Athletic Website Scraping: Weekly automated runs
Sports Reference: Quarterly manual reviews

API Keys Required

# College Scorecard
COLLEGE_SCORECARD_API_KEY="your-api-key-here"
# Get yours at: https://api.data.gov/signup/

# Firecrawl (for web scraping)
FIRECRAWL_API_KEY="fc-your-api-key"
# Get yours at: https://firecrawl.dev

# Together AI (optional, for enhanced extraction)
TOGETHER_API_KEY="your-together-api-key"

Data Collection

Manual Data Import

# Import all data sources in sequence
bun scripts/import-college-scorecard.ts
bun scripts/import-ipeds-data.ts
bun scripts/import-ncaa-directory.ts

# Process and validate
bun process:ncaa-data
bun validate:datasets

Scraping Methods

The platform supports three scraping approaches:

Firecrawl - Fast, works well for accessible sites
Puppeteer - Advanced anti-bot evasion for blocked sites
Hybrid - Automatically chooses best method per school

Manual Scraping

# Run hybrid scraping on all schools
bun scrape:hybrid

# Run specific method
bun scrape:firecrawl
bun scrape:puppeteer

# Add new schools to database
bun migrate:schools

# Monitor data quality
bun monitor:data-quality

Scraping Performance

Small/Mid Schools: 90%+ success with Firecrawl
Major Programs: 75%+ success with Puppeteer (Alabama, UCLA, etc.)
Overall Hybrid: ~87% expected success rate

Development

Available Scripts

# Development
bun dev              # Start dev server
bun build            # Build for production
bun start            # Start production server

# Database
bun db:migrate       # Run migrations
bun db:seed          # Seed test data
bun db:reset         # Reset database
bun db:gen-types     # Generate TypeScript types

# Data Collection
bun scrape:hybrid    # Hybrid scraping
bun scrape:firecrawl # Firecrawl only
bun scrape:puppeteer # Puppeteer only

# Dataset Management
bun process:ncaa-data        # Process NCAA CSV files
bun validate:datasets        # Validate dataset quality
bun migrate:schools          # Add major NCAA schools

# Monitoring
bun monitor:data-quality        # Data quality check
bun analyze:blocked-schools     # Identify blocked schools

# Testing
bun test             # Run unit tests
bun test:e2e         # Run E2E tests
bun lint             # Lint code
bun typecheck        # TypeScript check

Project Structure

school-stats/
├── app/
│   ├── api/                 # API endpoints
│   │   ├── schools/         # School data APIs
│   │   ├── staff/           # Staff search APIs
│   │   ├── scrape/          # Scraping triggers
│   │   └── admin/           # Admin operations
│   └── dashboard/           # Optional admin dashboard
├── lib/
│   ├── scraping/            # Scraping orchestration
│   ├── firecrawl/           # Firecrawl integration
│   ├── puppeteer/           # Stealth scraping
│   ├── validation/          # Data quality validation
│   ├── api/                 # API utilities
│   └── supabase/            # Database client
├── scripts/
│   ├── scraping/            # Manual scraping scripts
│   ├── data-migration/      # Data import scripts
│   └── monitoring/          # Quality monitoring
├── datasets/                # CSV datasets and documentation
│   ├── raw/                 # Original NCAA and sports data
│   ├── processed/           # Cleaned datasets and reports
│   └── README.md            # Dataset documentation
└── supabase/
    ├── migrations/          # Database schema
    └── tests/               # Database tests

Production Deployment

Live Production URL: https://us-schools-data.vercel.app

Supabase Setup

Create new Supabase project
Copy connection details to .env.production
Run migrations: supabase db push --db-url="your-prod-db-url"

Vercel Deployment

Connect repository to Vercel
Set environment variables in Vercel dashboard
Deploy automatically on push to main

The platform is currently deployed at: https://us-schools-data.vercel.app

Environment Variables

# Production Database
DATABASE_URL="postgresql://user:pass@host:port/school_stats"
NEXT_PUBLIC_SUPABASE_URL="https://your-project.supabase.co"
NEXT_PUBLIC_SUPABASE_ANON_KEY="your-prod-anon-key"
SUPABASE_SERVICE_ROLE_KEY="your-prod-service-role-key"

# External APIs
FIRECRAWL_API_KEY="fc-your-production-key"
TOGETHER_API_KEY="your-production-key"

# Security
API_SECRET_KEY="your-secure-32-char-secret"
ADMIN_API_KEY="your-secure-admin-key"

API Documentation

Response Format

All endpoints return consistent JSON responses:

{
  "success": true,
  "data": [...],
  "message": "Optional message",
  "metadata": {
    "timestamp": "2025-08-27T00:00:00.000Z",
    "pagination": {...},
    "statistics": {...}
  }
}

Error Handling

Error responses include helpful details:

{
  "success": false,
  "error": "Descriptive error message",
  "metadata": {
    "timestamp": "2025-08-27T00:00:00.000Z"
  }
}

Rate Limiting

Default: 1000 requests/hour per API key
Admin keys: 5000 requests/hour
Rate limit headers included in responses

Data Quality

Coach Record Standards

Names: Properly formatted (John Smith, not "john smith")
Titles: Standardized (Head Coach, Assistant Coach, etc.)
Contact Info: Valid email/phone formats when available
Confidence Scores: 0.6-1.0 based on extraction quality

Monitoring

The platform tracks:

Scraping success rates by method and school
Data quality metrics and confidence scores
API usage patterns and performance
Contact information coverage rates

Support

Common Issues

API Key Invalid: Check key format and permissions
Rate Limited: Wait for rate limit window to reset
School Not Found: Verify school ID exists in database
Scraping Failed: Check school website accessibility

Debugging

# Check API key status
curl -H "Authorization: Bearer your-key" \
  http://localhost:3000/api/admin/scrape

# View recent scraping runs
curl -H "Authorization: Bearer your-admin-key" \
  http://localhost:3000/api/admin/scrape?limit=5

# Test individual school scraping
bun lib/scraping/hybrid-scraper-system.ts

Contributing

Fork the repository
Create feature branch: git checkout -b feature/new-feature
Commit changes: git commit -am 'Add new feature'
Push to branch: git push origin feature/new-feature
Create Pull Request

License

Private - NCRA Platform Internal Use Only

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.claude.backup		.claude.backup
.claude		.claude
app		app
components		components
datasets		datasets
docs		docs
hooks		hooks
lib		lib
screenshots		screenshots
scripts		scripts
supabase		supabase
test-results		test-results
tests		tests
.env.development		.env.development
.env.example		.env.example
.env.local.backup-20251011-093034		.env.local.backup-20251011-093034
.env.preview		.env.preview
.env.production		.env.production
.env.supabase		.env.supabase
.gitignore		.gitignore
.mcp.json		.mcp.json
APPLY_CLEANUP_VIA_SQL_EDITOR.sql		APPLY_CLEANUP_VIA_SQL_EDITOR.sql
CANCEL_QUERY.sql		CANCEL_QUERY.sql
CHECK_INDEX.sql		CHECK_INDEX.sql
CLAUDE.md		CLAUDE.md
CLEANUP_MIGRATION_FIXED.sql		CLEANUP_MIGRATION_FIXED.sql
FINAL-PROJECT-SUMMARY.json		FINAL-PROJECT-SUMMARY.json
FINAL_VERIFICATION.sql		FINAL_VERIFICATION.sql
README.md		README.md
SAMPLE_RESULTS.json		SAMPLE_RESULTS.json
STEP_1_drop_view.sql		STEP_1_drop_view.sql
STEP_2_drop_column.sql		STEP_2_drop_column.sql
STEP_3_recreate_view.sql		STEP_3_recreate_view.sql
STEP_4_drop_legacy_table.sql		STEP_4_drop_legacy_table.sql
STEP_5_create_index.sql		STEP_5_create_index.sql
TASK-111-SUMMARY.txt		TASK-111-SUMMARY.txt
TASKS_131-150_VERIFICATION.txt		TASKS_131-150_VERIFICATION.txt
URL_IMPROVEMENT_FINAL_SUMMARY.txt		URL_IMPROVEMENT_FINAL_SUMMARY.txt
add-column-and-migrate.sql		add-column-and-migrate.sql
add-photo-url-column.sql		add-photo-url-column.sql
alice-lloyd-staff-data.txt		alice-lloyd-staff-data.txt
analyze-sidearm-schools.ts		analyze-sidearm-schools.ts
apply-cleanup.sh		apply-cleanup.sh
athletic_staff_from_ncra.sql		athletic_staff_from_ncra.sql
athletic_staff_from_ncra_pooler.sql		athletic_staff_from_ncra_pooler.sql
athletic_staff_rows.csv		athletic_staff_rows.csv
athletic_staff_rows.sql		athletic_staff_rows.sql
batch-1-tasks.json		batch-1-tasks.json
batch1-clean.sql		batch1-clean.sql
batch1-di-schools.json		batch1-di-schools.json
batch1-di-urls.sql		batch1-di-urls.sql
batch1-dii-urls.sql		batch1-dii-urls.sql
batch1.sql		batch1.sql
batch2-di-schools.json		batch2-di-schools.json
batch2-di-urls.sql		batch2-di-urls.sql
batch2-dii-schools.json		batch2-dii-schools.json
batch2-dii-urls.sql		batch2-dii-urls.sql
batch2.sql		batch2.sql
batch4-results.json		batch4-results.json
bethel-tn-debug.html		bethel-tn-debug.html
bluefield-staff-directory.html		bluefield-staff-directory.html
bridgewater-final.html		bridgewater-final.html
bridgewater-raw.html		bridgewater-raw.html
bun.lock		bun.lock
cal-maritime-directory.html		cal-maritime-directory.html
cal-maritime-final.html		cal-maritime-final.html
calpoly-staff.json		calpoly-staff.json
check-alternative-domains.sh		check-alternative-domains.sh
check-bluefield-db.ts		check-bluefield-db.ts
check-brandable-domains.sh		check-brandable-domains.sh
check-connection.ts		check-connection.ts
check-creative-domains.sh		check-creative-domains.sh
check-domain-availability.ts		check-domain-availability.ts
check-domain-whois.sh		check-domain-whois.sh
check-domains-cli.sh		check-domains-cli.sh
check-structure.ts		check-structure.ts
check_db.sql		check_db.sql
checkpoint-all-schools.json		checkpoint-all-schools.json
checkpoint-logos.json		checkpoint-logos.json
checkpoint-missing-staff.json		checkpoint-missing-staff.json
checkpoint-sidearm.json		checkpoint-sidearm.json
clean-bluefield-bad-data.ts		clean-bluefield-bad-data.ts
cleanup_ncra_schools.sql		cleanup_ncra_schools.sql
cleary-page-content.html		cleary-page-content.html
complete-remote-migration.sql		complete-remote-migration.sql
components.json		components.json
concordia-raw.html		concordia-raw.html
create-faculty-photos-bucket.sql		create-faculty-photos-bucket.sql
daemen-html-full.html		daemen-html-full.html
debug-alice-lloyd.html		debug-alice-lloyd.html
debug-dominican.html		debug-dominican.html
debug-kcu-final.html		debug-kcu-final.html
debug-kcu-html.html		debug-kcu-html.html
debug-point-page.html		debug-point-page.html
debug-scrape.ts		debug-scrape.ts
enhanced-batch-aa		enhanced-batch-aa
enhanced-batch-ab		enhanced-batch-ab
enhanced-batch-ac		enhanced-batch-ac
enhanced-batch-ad		enhanced-batch-ad
enhanced-batch-ae		enhanced-batch-ae

jbwashington/school-stats

Folders and files

Latest commit

History

Repository files navigation