A dedicated data collection and API service for comprehensive NCAA athletic program data. This platform specializes in scraping, processing, and serving athletic staff data via authenticated API endpoints.
Production URL: https://us-schools-data.vercel.app
- Bun (latest version)
- Supabase CLI
- Git
# 1. Clone the repository
git clone <your-school-stats-repo>
cd school-stats
# 2. Install dependencies
bun install
# 3. Set up environment variables
cp .env.example .env.local
# Edit .env.local with your actual values
# 4. Start Supabase locally
supabase start
# 5. Run database migrations
bun db:migrate
# 6. Generate TypeScript types
bun db:gen-types
# 7. Start development server
bun devEdit .env.local with your Supabase and API keys:
# Get these from: supabase status
NEXT_PUBLIC_SUPABASE_URL="http://localhost:54321"
NEXT_PUBLIC_SUPABASE_ANON_KEY="your-anon-key-from-supabase-status"
SUPABASE_SERVICE_ROLE_KEY="your-service-role-key-from-supabase-status"
# Get API keys from respective services
FIRECRAWL_API_KEY="fc-your-firecrawl-api-key"
TOGETHER_API_KEY="your-together-ai-key"
# Generate secure random strings
API_SECRET_KEY="your-32-character-secret-key"
ADMIN_API_KEY="your-admin-api-key"All API endpoints require authentication via API key in the Authorization header:
# Local development
curl -H "Authorization: Bearer your-api-key" \
http://localhost:3000/api/schools
# Production
curl -H "Authorization: Bearer your-api-key" \
https://us-schools-data.vercel.app/api/schoolsThe system includes pre-configured test keys:
- Read-only:
school_stats_test_key_12345678901234567890 - Admin:
school_stats_admin_key_98765432109876543210
# Get all schools
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
"http://localhost:3000/api/schools"
# Filter by conference and state
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
"http://localhost:3000/api/schools?conference=SEC&state=Alabama"# Get staff for a specific school
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
"http://localhost:3000/api/schools/8/staff"
# Filter by sport and title
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
"http://localhost:3000/api/schools/8/staff?sport=Football&title=head-coach"# Search for coaches by name and sport
curl -H "Authorization: Bearer school_stats_test_key_12345678901234567890" \
"http://localhost:3000/api/staff/search?name=john&sport=basketball"# Start hybrid scraping job
curl -X POST \
-H "Authorization: Bearer school_stats_admin_key_98765432109876543210" \
-H "Content-Type: application/json" \
-d '{"method": "hybrid", "school_ids": [8, 15, 23]}' \
http://localhost:3000/api/admin/scrapeThe platform aggregates data from multiple authoritative sources to provide comprehensive school and athletic information:
Official data from the U.S. Department of Education's College Scorecard API:
- Institution Information: School names, locations, website URLs
- Carnegie Classifications: Research activity levels, institution types
- Enrollment Data: Student population, demographics, admission rates
- Academic Programs: Degree types offered, program availability
- Financial Data: Tuition costs, financial aid statistics
- Outcomes: Graduation rates, median earnings, debt levels
API Endpoint: https://api.data.gov/ed/collegescorecard/v1/schools
# Import College Scorecard data
COLLEGE_SCORECARD_API_KEY="your-api-key" bun scripts/import-college-scorecard.ts
# Bulk import with enhanced matching
bun scripts/import-scorecard-bulk.tsCoverage: 6,000+ institutions including all Title IV eligible schools
Official statistics from the National Center for Education Statistics:
- UNITID: Unique identifier for institution matching
- Institutional Characteristics: Control (public/private), size, setting
- Geographic Data: FIPS codes, locale codes, census regions
- Institutional Categories: Sector classifications, degree-granting status
- Historical Data: Multi-year trends and institutional changes
Source: IPEDS CSV data files from NCES
# Import IPEDS data
bun import:ipeds
# Alternatively, use the script directly
bun scripts/import-ipeds-data.tsCoverage: All postsecondary institutions that participate in federal financial aid programs
Official NCAA member institution data:
- Complete NCAA Schools: Full directory of NCAA member institutions
- Conference Affiliations: Official conference memberships and divisions
- Athletic Websites: Verified URLs for athletic department websites
- Division Classifications: Division I, II, III designations
- Sport Sponsorships: Sports offered by each institution
Source: NCAA member directory and official publications
# Import NCAA directory data
bun scripts/import-ncaa-directory.ts
# Process NCAA datasets
bun process:ncaa-dataCoverage: 1,100+ NCAA member institutions across all divisions
Real-time data extraction from school athletic websites:
- Athletic Staff: Coaches, administrators, support staff
- Contact Information: Email addresses, phone numbers
- Staff Biographies: Background, experience, achievements
- Sport Assignments: Sport-specific coaching staffs
- Position Titles: Standardized role classifications
Methods:
- Firecrawl API: AI-powered scraping for accessible sites (90%+ success)
- Puppeteer + Stealth: Advanced anti-bot evasion for protected sites (100% access)
- Hybrid Mode: Automatic method selection per school
# Run hybrid scraping
bun scrape:hybrid
# Run specific methods
bun scrape:firecrawl
bun scrape:puppeteerCoverage: 1,000+ athletic department websites with continuous updates
Curated sports classification and metadata:
- Sports Classifications: Standardized sport names and categories
- Performance Metrics: Available statistics per sport
- MaxPreps Integration: Sport coverage and recruitment priority
- Gender Categories: Men's, women's, co-ed sport classifications
Source: Internal curation and sports data standards
# Validate datasets
bun validate:datasets
# View dataset documentation
cat datasets/README.mdCollege Scorecard API
↓
[Match by Name/Location]
↓
IPEDS Data (UNITID matching)
↓
[Enhance with]
↓
NCAA Directory (Athletic focus)
↓
[Real-time updates from]
↓
Athletic Website Scraping
↓
[Enrich with]
↓
Sports Reference Metadata
↓
[Final Database]
- College Scorecard: Annual updates (typically September)
- IPEDS: Annual updates (released in phases throughout year)
- NCAA Directory: Semi-annual updates (fall/spring)
- Athletic Website Scraping: Weekly automated runs
- Sports Reference: Quarterly manual reviews
# College Scorecard
COLLEGE_SCORECARD_API_KEY="your-api-key-here"
# Get yours at: https://api.data.gov/signup/
# Firecrawl (for web scraping)
FIRECRAWL_API_KEY="fc-your-api-key"
# Get yours at: https://firecrawl.dev
# Together AI (optional, for enhanced extraction)
TOGETHER_API_KEY="your-together-api-key"# Import all data sources in sequence
bun scripts/import-college-scorecard.ts
bun scripts/import-ipeds-data.ts
bun scripts/import-ncaa-directory.ts
# Process and validate
bun process:ncaa-data
bun validate:datasetsThe platform supports three scraping approaches:
- Firecrawl - Fast, works well for accessible sites
- Puppeteer - Advanced anti-bot evasion for blocked sites
- Hybrid - Automatically chooses best method per school
# Run hybrid scraping on all schools
bun scrape:hybrid
# Run specific method
bun scrape:firecrawl
bun scrape:puppeteer
# Add new schools to database
bun migrate:schools
# Monitor data quality
bun monitor:data-quality- Small/Mid Schools: 90%+ success with Firecrawl
- Major Programs: 75%+ success with Puppeteer (Alabama, UCLA, etc.)
- Overall Hybrid: ~87% expected success rate
# Development
bun dev # Start dev server
bun build # Build for production
bun start # Start production server
# Database
bun db:migrate # Run migrations
bun db:seed # Seed test data
bun db:reset # Reset database
bun db:gen-types # Generate TypeScript types
# Data Collection
bun scrape:hybrid # Hybrid scraping
bun scrape:firecrawl # Firecrawl only
bun scrape:puppeteer # Puppeteer only
# Dataset Management
bun process:ncaa-data # Process NCAA CSV files
bun validate:datasets # Validate dataset quality
bun migrate:schools # Add major NCAA schools
# Monitoring
bun monitor:data-quality # Data quality check
bun analyze:blocked-schools # Identify blocked schools
# Testing
bun test # Run unit tests
bun test:e2e # Run E2E tests
bun lint # Lint code
bun typecheck # TypeScript checkschool-stats/
├── app/
│ ├── api/ # API endpoints
│ │ ├── schools/ # School data APIs
│ │ ├── staff/ # Staff search APIs
│ │ ├── scrape/ # Scraping triggers
│ │ └── admin/ # Admin operations
│ └── dashboard/ # Optional admin dashboard
├── lib/
│ ├── scraping/ # Scraping orchestration
│ ├── firecrawl/ # Firecrawl integration
│ ├── puppeteer/ # Stealth scraping
│ ├── validation/ # Data quality validation
│ ├── api/ # API utilities
│ └── supabase/ # Database client
├── scripts/
│ ├── scraping/ # Manual scraping scripts
│ ├── data-migration/ # Data import scripts
│ └── monitoring/ # Quality monitoring
├── datasets/ # CSV datasets and documentation
│ ├── raw/ # Original NCAA and sports data
│ ├── processed/ # Cleaned datasets and reports
│ └── README.md # Dataset documentation
└── supabase/
├── migrations/ # Database schema
└── tests/ # Database tests
Live Production URL: https://us-schools-data.vercel.app
- Create new Supabase project
- Copy connection details to
.env.production - Run migrations:
supabase db push --db-url="your-prod-db-url"
- Connect repository to Vercel
- Set environment variables in Vercel dashboard
- Deploy automatically on push to main
The platform is currently deployed at: https://us-schools-data.vercel.app
# Production Database
DATABASE_URL="postgresql://user:pass@host:port/school_stats"
NEXT_PUBLIC_SUPABASE_URL="https://your-project.supabase.co"
NEXT_PUBLIC_SUPABASE_ANON_KEY="your-prod-anon-key"
SUPABASE_SERVICE_ROLE_KEY="your-prod-service-role-key"
# External APIs
FIRECRAWL_API_KEY="fc-your-production-key"
TOGETHER_API_KEY="your-production-key"
# Security
API_SECRET_KEY="your-secure-32-char-secret"
ADMIN_API_KEY="your-secure-admin-key"All endpoints return consistent JSON responses:
{
"success": true,
"data": [...],
"message": "Optional message",
"metadata": {
"timestamp": "2025-08-27T00:00:00.000Z",
"pagination": {...},
"statistics": {...}
}
}Error responses include helpful details:
{
"success": false,
"error": "Descriptive error message",
"metadata": {
"timestamp": "2025-08-27T00:00:00.000Z"
}
}- Default: 1000 requests/hour per API key
- Admin keys: 5000 requests/hour
- Rate limit headers included in responses
- Names: Properly formatted (John Smith, not "john smith")
- Titles: Standardized (Head Coach, Assistant Coach, etc.)
- Contact Info: Valid email/phone formats when available
- Confidence Scores: 0.6-1.0 based on extraction quality
The platform tracks:
- Scraping success rates by method and school
- Data quality metrics and confidence scores
- API usage patterns and performance
- Contact information coverage rates
- API Key Invalid: Check key format and permissions
- Rate Limited: Wait for rate limit window to reset
- School Not Found: Verify school ID exists in database
- Scraping Failed: Check school website accessibility
# Check API key status
curl -H "Authorization: Bearer your-key" \
http://localhost:3000/api/admin/scrape
# View recent scraping runs
curl -H "Authorization: Bearer your-admin-key" \
http://localhost:3000/api/admin/scrape?limit=5
# Test individual school scraping
bun lib/scraping/hybrid-scraper-system.ts- Fork the repository
- Create feature branch:
git checkout -b feature/new-feature - Commit changes:
git commit -am 'Add new feature' - Push to branch:
git push origin feature/new-feature - Create Pull Request
Private - NCRA Platform Internal Use Only