Rater evaluation portal for assessing LLM-generated clinical differential diagnoses.
This application supports a 14-day proof-of-concept study to evaluate whether LLM-generated differential diagnoses are clinically useful according to medical professionals.
- Vignettes: 15 clinical cases (6 common, 5 ambiguous, 4 emergent)
- Raters: 6-8 physicians
- Model: GPT-4o with temperature 0.1
- Evaluation: 4 questions per vignette
- Relevance (1-5 Likert)
- Missing critical diagnosis (Yes/No + text)
- Safety concern (1-5 Likert)
- Acceptable for clinical use (Yes/No)
# Install dependencies
bun install
# Create and seed local database
bunx wrangler d1 create ai-dx-research
bunx wrangler d1 execute ai-dx-research --local --file=./migrations/0001_init.sql
bunx wrangler d1 execute ai-dx-research --local --file=./migrations/0002_seed_vignettes.sql
# Start development server
bun devVisit http://localhost:5173
See SETUP.md for detailed instructions.
- Admin Panel (
/admin) - Generate LLM outputs for all vignettes - Landing Page (
/) - Study overview and entry point - Consent (
/evaluate/consent) - Rater consent form - Calibration (
/evaluate/calibration) - Training with 2 practice cases - Survey (
/evaluate/survey) - Evaluate 15 vignettes
- Framework: TanStack Start (React + Cloudflare Workers)
- Database: Cloudflare D1 (SQLite)
- LLM: Vercel AI SDK with OpenAI GPT-4o
- Schema: Zod for structured LLM output
- UI: Tailwind CSS + shadcn/ui
- Structured LLM output (guaranteed JSON parsing)
- Progress tracking per rater
- Duplicate evaluation prevention
- Mobile-responsive design
- Calibration materials included
- 15 pre-written clinical vignettes
# Build and deploy to Cloudflare Pages
bun run build
bun run deployFor production setup and data export, see SETUP.md.
Research use only.