Enterprise-Grade OpenAI-Compatible Proxy for Cost Optimization and Operational Excellence
CCProxy is a high-performance, production-ready proxy service that enables organizations to leverage multiple AI model providers through a unified OpenAI-compatible interface. By implementing intelligent caching, request deduplication, and provider abstraction, CCProxy delivers significant cost reductions while maintaining enterprise-grade reliability and security.
Recent analytics demonstrate substantial cost disparities across AI model providers. OpenAI GPT-5 and xAI Grok models offer significantly better cost efficiency compared to Anthropic Claude Opus 4.1 (approximately $11.25 versus $90 per million combined input/output tokens). CCProxy addresses this challenge by:
- Eliminating Duplicate API Calls: Intelligent caching prevents redundant requests
- Optimizing Transport Efficiency: HTTP/2 and connection pooling reduce overhead
- Enabling Provider Flexibility: Seamlessly switch between providers without code changes
- Standardizing Integration: Single API interface for all supported models
| Model | Input Tokens ($/1M) | Output Tokens ($/1M) |
|---|---|---|
| OpenAI: GPTβ5 | $1.25 | $10.00 |
| Anthropic: Claude Opus 4.1 | $15.00 | $75.00 |
| Anthropic: Claude Sonnet 4.5 (β€200K) | $3.00 | $15.00 |
| Anthropic: Claude Sonnet 4.5 (>200K) | $6.00 | $22.50 |
| xAI: Grok 4 Fast (<=128K) | $0.20 | $0.50 |
| xAI: Grok 4 Fast (>128K) | $0.50 | $1.00 |
- GPTβ5 input and output rates are confirmed via Wired, OpenAI's own API pricing page, and TechCrunch
- Claude Opus 4.1 pricing is stated directly on Anthropic's API pricing page
- Claude Sonnet 4.5 has tiered pricing based on context length (β€200K tokens vs >200K tokens)
- Grok Code Fast 1 pricing is from xAI's official OpenRouter listing
- Grok 4 Fast pricing is from xAI's official OpenRouter listing
CCProxy enforces maximum output token limits for supported models:
| Model | Context Window | Max Output Tokens |
|---|---|---|
| o3 | 200,000 | 100,000 |
| o3-2025-04-16 | 200,000 | 100,000 |
| o4-mini | 128,000 | 100,000 |
| gpt-5-2025-08-07 | 400,000 | 128,000 |
| gpt-5 | 400,000 | 128,000 |
| gpt-5-mini-2025-08-07 | 400,000 | 128,000 |
| gpt-5-mini | 400,000 | 128,000 |
| deepseek-reasoner | 163,840 | 65,536 |
| deepseek-chat | 163,840 | 8,192 |
| x-ai/grok-code-fast-1 | 256,000 | 10,000 |
| x-ai/grok-4-fast | 2,000,000 | 30,720 |
Note: Models not listed in this table use their default maximum output token limits.
CCProxy includes high-performance HTTP client optimizations for faster OpenAI API communication:
- HTTP/2 Support: Enabled by default for request multiplexing
- Enhanced Connection Pooling: 50 keepalive connections, 500 max connections
- Compression: Supports gzip, deflate, and Brotli
- Smart Retries: Automatic retry with exponential backoff
- Response Caching: Prevents duplicate API calls and handles timeouts
- Async Processing: Full async/await architecture with ThreadPoolExecutor for CPU-bound operations
- Parallel Message Conversion: Concurrent processing of message batches for reduced latency
- Non-blocking I/O: Async streaming with httpx for improved throughput
- 30-50% faster single request latency
- 2-3x better throughput for concurrent requests
- Reduced connection overhead with persistent connections
- 40% reduction in message conversion time via async parallelization
- Near-zero blocking on I/O operations with full async pipeline
See HTTP_OPTIMIZATION.md for details.
CCProxy follows Clean Architecture (Hexagonal Architecture) principles with clear separation of concerns:
graph TB
subgraph "External Clients"
Client[Claude Code / API Clients]
end
subgraph "Interface Layer"
HTTP[FastAPI HTTP Interface]
Routes[Route Handlers]
MW[Middleware Chain]
Stream[SSE Streaming]
Guard[Input Guardrails]
end
subgraph "Application Layer"
Conv[Message Converters]
Cache[Response Cache]
Token[Tokenizer Service]
Model[Model Selection]
Valid[Request Validator]
Error[Error Tracker]
end
subgraph "Domain Layer"
DModel[Domain Models]
DExc[Domain Exceptions]
BLogic[Core Business Logic]
end
subgraph "Infrastructure Layer"
Provider[Provider Abstraction]
OpenAI[OpenAI Provider]
HTTP2[HTTP/2 Client]
Pool[Connection Pool]
end
subgraph "External Services"
OAPI[OpenAI API]
OR[OpenRouter API]
end
subgraph "Configuration & Monitoring"
Config[Settings/Config]
Log[JSON Logging]
Monitor[Metrics & Health]
end
Client -->|Anthropic Messages API| HTTP
HTTP --> Routes
Routes --> MW
MW --> Guard
Guard --> Conv
Conv --> Cache
Conv --> Token
Conv --> Model
Conv --> Valid
Conv --> Error
Conv --> DModel
Error --> DExc
Model --> BLogic
Conv --> Provider
Provider --> OpenAI
OpenAI --> HTTP2
HTTP2 --> Pool
Pool --> OAPI
Pool --> OR
Routes --> Stream
Stream --> Provider
Config -.->|Inject| HTTP
Log -.->|Track| MW
Monitor -.->|Observe| Cache
Monitor -.->|Health| HTTP
style Client fill:#e1f5fe
style HTTP fill:#fff3e0
style Routes fill:#fff3e0
style MW fill:#fff3e0
style Stream fill:#fff3e0
style Guard fill:#fff3e0
style Conv fill:#f3e5f5
style Cache fill:#f3e5f5
style Token fill:#f3e5f5
style Model fill:#f3e5f5
style Valid fill:#f3e5f5
style Error fill:#f3e5f5
style DModel fill:#e8f5e9
style DExc fill:#e8f5e9
style BLogic fill:#e8f5e9
style Provider fill:#fce4ec
style OpenAI fill:#fce4ec
style HTTP2 fill:#fce4ec
style Pool fill:#fce4ec
style OAPI fill:#ffebee
style OR fill:#ffebee
style Config fill:#f5f5f5
style Log fill:#f5f5f5
style Monitor fill:#f5f5f5
- Core Business Logic: Pure business rules independent of external concerns
- Domain Models: Core entities and data structures
- Domain Exceptions: Business-specific error handling
- Use Cases: Orchestrates domain logic and infrastructure
- Message Conversion: Anthropic β OpenAI format translation
- Caching Strategy: Response caching with de-duplication
- Token Management: Async token counting with TTL cache (300s)
- Model Mapping: Routes requests to appropriate models (Opus/SonnetβBIG, HaikuβSMALL)
- Request Validation: Cryptographic hashing with LRU cache (10k capacity)
- Provider Integration: OpenAI/OpenRouter API communication
- HTTP/2 Client: High-performance connection pooling (500 connections, 120s keepalive)
- Circuit Breaker: Fault tolerance and resilience patterns
- External Services: Handles all third-party integrations
- HTTP API: FastAPI application with dependency injection
- Route Handlers: Request/response processing
- SSE Streaming: Real-time response streaming
- Middleware: Request tracing, logging, error handling
- Input Validation: Security guardrails and sanitization
- Configuration: Environment-based settings with Pydantic validation
- Logging: Structured JSON logging with request correlation
- Monitoring: Performance metrics, health checks, cache statistics
- Error Tracking: Centralized error monitoring and alerting
- Create your environment file from the template:
cp .env.example .env
# edit .env to set OPENAI_API_KEY, BIG_MODEL_NAME, SMALL_MODEL_NAME- Install Python dependencies into an isolated environment using uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv
. .venv/bin/activate
uv pip install -r requirements.txt- Start the server (pure Python with Uvicorn):
./run-ccproxy.shFor local development, you can set IS_LOCAL_DEPLOYMENT=True in your .env file to use a single worker process for reduced resource usage.
- Point your Anthropic client at the proxy:
export ANTHROPIC_BASE_URL=http://localhost:11434Then start your coding session with Claude Code:
claude-
OPENAI_API_KEY: Your OpenAI API key (or useOPENROUTER_API_KEY) -
BIG_MODEL_NAME: The OpenAI model to use for large Anthropic models (e.g.,gpt-5-2025-08-07) -
SMALL_MODEL_NAME: The OpenAI model to use for small Anthropic models (e.g.,gpt-5-mini-2025-08-07)
-
IS_LOCAL_DEPLOYMENT: Set toTrueto use a single worker process for local development (default:False) -
HOST: Server host (default:127.0.0.1) -
PORT: Server port (default:11434) -
LOG_LEVEL: Logging level (default:INFO) -
OPENAI_BASE_URL: OpenAI API base URL (default:https://api.openai.com/v1)
CCWorkforce Engineers