This backend provides an AI chat API using FastAPI as the router and llama.cpp for inference. It includes OpenAI Harmony format support for improved GPT-OSS model responses.
[Client] → [Router:8000] → [Inference:8080]
↓ ↓
FastAPI API llama.cpp server
+ Harmony + GPT-OSS model
Docker has significant performance limitations on Apple Silicon. Use the local development script instead:
cd backend
# Start both services with full GPU acceleration (~15x faster than Docker)
./start-local-dev.sh
# Test the setup
./test-local-dev.sh
# Stop services: Press Ctrl+C in the start script terminal# Start all services
docker-compose up -d
# View logs
docker-compose logs -f# Health check
curl http://localhost:8000/health
# Chat endpoint
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "Hello, how are you?"}'- Purpose: API gateway and request handler
- Features:
- OpenAI Harmony format support
- Request routing to inference service
- Response parsing and cleanup
- Endpoints:
GET /health- Health checkPOST /api/chat- Chat completion
- Purpose: Runs the GPT-OSS 20B model
- Engine: llama.cpp server
- Model:
gpt-oss-20b-Q4_K_S.gguf(quantized) - Endpoints:
/v1/completions- Used with Harmony format/v1/chat/completions- Standard chat format
HARMONY_ENABLED- Enable/disable Harmony format (default: true)HARMONY_REASONING_EFFORT- Reasoning depth: low/medium/high (default: low)INFERENCE_URL- Inference service URL (default: http://inference:8080)LOG_LEVEL- Logging level (default: DEBUG)
MODEL_PATH- Path to GGUF model fileHOST- Server host (default: 0.0.0.0)PORT- Server port (default: 8080)CONTEXT_SIZE- Context window size (default: 4096)THREADS- CPU threads (0 = auto)GPU_LAYERS- GPU layers for acceleration (0 = CPU only)
# Stop services
docker-compose down
# Rebuild router (after code changes)
docker-compose build router
# Restart services
docker-compose up -d# All services
docker-compose logs -f
# Specific service
docker-compose logs -f router
docker-compose logs -f inference# Complete cleanup and restart
docker-compose down -v
docker-compose build --no-cache
docker-compose up -dThe backend uses OpenAI Harmony format to improve GPT-OSS model responses:
- Enabled: Model receives structured conversation context
- Disabled: Standard chat completion format
Harmony provides:
- Better reasoning with analysis channels
- Cleaner final responses
- Mobile-optimized brevity
- Structured token handling
# Check logs
docker-compose logs router
docker-compose logs inference
# Verify health
docker-compose ps# Check if model file exists
docker-compose exec inference ls -la /models/
# Check inference logs
docker-compose logs inference | grep -i error# Rebuild router with fresh dependencies
docker-compose build --no-cache router# Check if ports are in use
lsof -i :8000
lsof -i :8080
# Use different ports in docker-compose.ymlcurl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "Explain quantum computing in one sentence"}'# Modify docker-compose.yml: HARMONY_ENABLED=false
docker-compose up -d router
# Test again
curl -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "Explain quantum computing in one sentence"}'backend/
├── docker-compose.yml # Service orchestration
├── router/
│ ├── Dockerfile # Router container
│ ├── main.py # FastAPI application
│ ├── harmony_service.py # Harmony format handler
│ ├── config.py # Configuration
│ └── pyproject.toml # Dependencies
└── inference/
├── Dockerfile # Inference container
└── model/ # Model files (mounted)
- Add authentication
- Implement streaming responses
- Add request caching
- Set up monitoring/metrics
- Configure HTTPS with nginx
- 4CST in CET doesnt work in prod but in dev
- Follow up on exercise works in prod but not in dev
- I guess it has something to do with harmony main prompt