Mini Cloud Control Plane

A simplified cloud control plane demonstrating asynchronous project lifecycle management — similar to how platforms like Supabase provision backend resources on demand. Built from scratch with a TypeScript API, Go worker, PostgreSQL as both the data store and message broker, and a React dashboard.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          Docker Network                             │
│                                                                     │
│  ┌───────────────┐   HTTP    ┌─────────────────────────────────┐   │
│  │   Browser     │ ────────► │    React Frontend  :5173        │   │
│  │ localhost:5173│ ◄──────── │    Vite dev server              │   │
│  └───────────────┘           │    /api/* → proxy               │   │
│                              └────────────┬────────────────────┘   │
│                                           │ proxy /api → :3000      │
│                              ┌────────────▼────────────────────┐   │
│  ┌───────────────┐           │   TypeScript API   :3000        │   │
│  │  curl / tools │ ─────────►│   Node.js + Express             │   │
│  │ localhost:3001│           │                                 │   │
│  └───────────────┘           │  POST /projects                 │   │
│                              │    INSERT row (status=creating) │   │
│                              │    pg_notify("provisioning_jobs")│   │
│                              │  GET  /projects[/:id]           │   │
│                              │  POST /projects/:id/retry       │   │
│                              │  GET  /metrics   GET /health    │   │
│                              └──────────────┬──────────────────┘   │
│                                             │ SQL + NOTIFY          │
│                              ┌──────────────▼──────────────────┐   │
│                              │     PostgreSQL 16   :5432        │   │
│                              │  projects table                 │   │
│                              │  lifecycle_events table         │   │
│                              │  LISTEN/NOTIFY channel          │   │
│                              └──────────────┬──────────────────┘   │
│                                             │ LISTEN                │
│                              ┌──────────────▼──────────────────┐   │
│                              │      Go Worker                  │   │
│                              │  pq.Listener (dedicated conn)   │   │
│                              │  Claim job atomically           │   │
│                              │  Simulate 2–5s work             │   │
│                              │  80% → ready / 20% → failed     │   │
│                              │  Record lifecycle_events        │   │
│                              └─────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Request flow — creating a project

1.  Browser sends POST /api/projects
2.  Vite proxy strips /api, forwards to API :3000
3.  API inserts row: { status: "creating" }
4.  API records lifecycle event: none → creating
5.  API calls pg_notify("provisioning_jobs", '{"project_id":"<uuid>"}')
6.  API returns 201 immediately — no waiting for the worker
7.  Worker receives NOTIFY on its dedicated listener connection
8.  Worker: UPDATE projects SET status='provisioning' WHERE id=$1 AND status='creating'
    → 0 rows affected → another worker claimed it → skip (idempotency guard)
9.  Worker sleeps 2–5 seconds (simulated provisioning)
10. Worker writes: ready (80%) or failed (20%) + error_reason
11. Worker records lifecycle event: provisioning → ready|failed
12. Browser poll picks up the new status

Stack

Component	Technology
Frontend	React 18 + Vite + TypeScript
API	Node.js 20 + Express + TypeScript
Worker	Go 1.22 + lib/pq
Queue	PostgreSQL `pg_notify` / `LISTEN`
Database	PostgreSQL 16
Orchestration	Docker Compose

Project Lifecycle State Machine

                     pg_notify
         ┌─────────────────────────────────────┐
         │                                     │
   ┌─────▼──────┐    Worker claims   ┌─────────┴────────┐
   │  creating  │ ────────────────► │   provisioning   │
   └────────────┘                   └────────┬─────────┘
                                             │
                          ┌──────────────────┤
                          │ 80%              │ 20%
                 ┌────────▼──────┐  ┌────────▼──────┐
                 │    ready      │  │    failed     │
                 └───────────────┘  └───────┬───────┘
                                            │
                                     POST /retry
                                            │
                                    (resets to creating,
                                     inside same TX as
                                     pg_notify — atomic)

Quick Start

# 1. Copy environment file
cp .env.example .env

# 2. Build and start all four services
docker compose up --build

# 3. Open the dashboard
open http://localhost:5173

# 4. Or hit the API directly
curl -s http://localhost:3001/health | jq

API Reference

All endpoints are available at http://localhost:3001 (host) or http://api:3000 (inside Docker).

`POST /projects`

Create a new project. Returns immediately (status: creating); provisioning is async.

curl -s -X POST http://localhost:3001/projects \
  -H "Content-Type: application/json" \
  -d '{"name": "my-project"}' | jq

Response 201:

{
  "id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "name": "my-project",
  "status": "creating",
  "error_reason": null,
  "created_at": "2024-01-01T00:00:00.000Z",
  "updated_at": "2024-01-01T00:00:00.000Z"
}

`GET /projects`

List all projects. Optional ?status= filter.

curl -s http://localhost:3001/projects | jq
curl -s "http://localhost:3001/projects?status=failed" | jq

`GET /projects/:id`

Get a single project by UUID.

curl -s http://localhost:3001/projects/<uuid> | jq

`POST /projects/:id/retry`

Re-queue a failed project. Only works when status === "failed". Returns 409 otherwise.

curl -s -X POST http://localhost:3001/projects/<uuid>/retry | jq

`GET /metrics`

Live counts by status.

curl -s http://localhost:3001/metrics | jq

Response:

{
  "counts_by_status": { "creating": 0, "provisioning": 1, "ready": 7, "failed": 2 },
  "failed_total": 2
}

`GET /health`

DB connectivity check. Used by Docker Compose health checks.

curl -s http://localhost:3001/health | jq

Dashboard (Frontend)

The React UI at http://localhost:5173 has three tabs:

Tab	What it shows
Projects	Create projects, watch live status transitions, retry failures
Metrics	Success rate, avg provision time, status breakdown bar, failures table
Settings	Poll interval (1s–30s), status filter, live API/DB health, stack info

Failure Scenarios

Worker crashes mid-provisioning

A project stuck in provisioning with no worker running will remain there indefinitely. There is no heartbeat/timeout mechanism in this implementation — a real system would add a background sweeper that resets stale provisioning rows back to creating after a configurable timeout.

PostgreSQL goes down while the API is running

The API uses a connection pool (pg.Pool). Requests that hit during the outage return 500. The startup loop retries the initial DB connection up to 15 times (2 s apart) before exiting. Once Postgres recovers, the pool reconnects automatically.

PostgreSQL goes down while the worker is running

pq.Listener has built-in reconnection with configurable min/max backoff (10 s–60 s). The outer ListenAndProcess loop adds an additional 5 s pause before restarting the listener. Any pg_notify events fired during the outage are lost — PostgreSQL does not buffer undelivered notifications. Projects in creating at the time of the outage will stay creating until manually retried.

Duplicate notifications for the same project

Can happen if the API retries a pg_notify or if two workers are running. The worker's atomic claim guard (UPDATE ... WHERE status='creating') ensures only one worker transitions the project. The second sees rowsAffected=0 and skips silently.

API container restarts after `pg_notify` but before response

The INSERT and pg_notify succeed (or both fail). If the container crashes after committing but before sending the HTTP response, the client gets a network error but the worker will still process the project. The client can poll GET /projects to confirm state.

Retry transaction rollback

POST /projects/:id/retry runs the status reset and the pg_notify inside a single transaction. If the transaction rolls back (e.g., DB error), PostgreSQL also discards the notification. No ghost jobs are enqueued.

Trade-offs

PostgreSQL as a queue vs. a dedicated broker (Kafka, RabbitMQ, SQS)

Chose PostgreSQL because:

Zero extra infrastructure — the queue is the same DB that stores state, so the notification and the row update are always consistent.
pg_notify fires inside transactions; if the transaction rolls back, the notification is discarded — free at-most-once delivery semantics.
Perfect for low-to-medium throughput (thousands of jobs/day).

Costs:

Notifications are not persisted. If no listener is connected when pg_notify fires, the message is dropped. A dedicated broker (Kafka, SQS) stores messages durably.
LISTEN uses a dedicated connection per worker — at very high worker counts this strains Postgres connection limits.
No backpressure or flow control; a burst of notifications floods workers immediately.

Polling vs. WebSockets / SSE on the frontend

Chose polling because:

Simple to implement and debug; no persistent connections to manage.
Adjustable interval (1–30 s) via the Settings tab.

Costs:

Even at 2 s polling, each browser tab sends 30 requests/min to the API. WebSockets or SSE would push updates only on changes, eliminating redundant requests.

Simulation vs. real provisioning

The worker's 2–5 s sleep and random outcome are intentional simplifications. In a real system this would be replaced by calls to a cloud provider SDK (e.g., AWS SDK, GCP client), Terraform, or Kubernetes API.

Scalability Notes

Horizontal API scaling

The API is stateless — multiple replicas behind a load balancer work without coordination. Each replica has its own pool connection to Postgres. The only shared state is the database.

Horizontal worker scaling

Multiple worker instances can run simultaneously. The idempotency guard (UPDATE ... WHERE status='creating') ensures each project is processed exactly once. Adding workers increases throughput linearly, up to Postgres connection limits.

Database bottlenecks

At high project creation rates:

projects and lifecycle_events grow unboundedly. Add a data-retention job to archive old rows.
The GROUP BY status query in GET /metrics does a full table scan. Add a materialized counter (e.g., a project_counts table updated by triggers) for O(1) metrics.
pg_notify payload is limited to 8 KB and notifications are not queued when no listener is connected. Use a pending_jobs table as a durable outbox for high-reliability requirements.

Connection limits

PostgreSQL defaults to 100 connections. The API pool uses up to 10. Each worker uses 1 query pool connection + 1 listener connection. With the default config you can run ~40 workers before hitting limits. Use PgBouncer in transaction-pooling mode to multiplex many app connections over fewer server connections.

Observability Strategy

Structured logging (current)

Every component emits JSON logs:

API — structured JSON via a custom log() helper. Every request/response includes request_id, method, path, status, duration_ms.
Worker — Go log/slog (built-in since Go 1.21). Every state transition logs project_id, worker_id, previous_state, new_state.

View live logs:

docker compose logs -f api
docker compose logs -f worker

What to add for production

Signal	Tool	What to instrument
Metrics	Prometheus + Grafana	`projects_created_total`, `provisioning_duration_seconds` histogram, `projects_by_status` gauge, API request rate/latency/error rate
Tracing	OpenTelemetry + Jaeger	Trace the full path: HTTP request → DB insert → pg_notify → worker LISTEN → final UPDATE; measure each span
Alerting	Grafana Alertmanager	Alert when `failed` rate exceeds threshold, when `provisioning` queue depth grows, when API p99 latency spikes
Error tracking	Sentry	Capture unhandled exceptions in API and worker with full stack traces
Health checks	k8s liveness probes	`GET /health` returns `{ status, db }` — wire into readiness and liveness probes

Lifecycle events as an audit log

The lifecycle_events table records every state transition with a timestamp:

SELECT project_id, previous_state, new_state, occurred_at
FROM lifecycle_events
ORDER BY occurred_at DESC
LIMIT 50;

This can feed a real-time audit log UI, or be streamed to a data warehouse via Debezium CDC for analytics.

Development

# View lifecycle events in the DB
docker compose exec postgres psql -U clouduser -d clouddb \
  -c "SELECT project_id, previous_state, new_state, occurred_at \
      FROM lifecycle_events ORDER BY occurred_at DESC LIMIT 20;"

# Run only the database
docker compose up postgres

# Run API locally (outside Docker)
cd api && npm install && npm run dev

# Run worker locally (outside Docker)
cd worker && go run .

Stopping the Stack

docker compose down

# Also remove the database volume (destroys all data):
docker compose down -v

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.claude		.claude
api		api
db		db
frontend		frontend
worker		worker
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Mini Cloud Control Plane

Architecture

Request flow — creating a project

Stack

Project Lifecycle State Machine

Quick Start

API Reference

POST /projects

GET /projects

GET /projects/:id

POST /projects/:id/retry

GET /metrics

GET /health

Dashboard (Frontend)

Failure Scenarios

Worker crashes mid-provisioning

PostgreSQL goes down while the API is running

PostgreSQL goes down while the worker is running

Duplicate notifications for the same project

API container restarts after pg_notify but before response

Retry transaction rollback

Trade-offs

PostgreSQL as a queue vs. a dedicated broker (Kafka, RabbitMQ, SQS)

Polling vs. WebSockets / SSE on the frontend

Simulation vs. real provisioning

Scalability Notes

Horizontal API scaling

Horizontal worker scaling

Database bottlenecks

Connection limits

Observability Strategy

Structured logging (current)

What to add for production

Lifecycle events as an audit log

Development

Stopping the Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /projects`

`GET /projects`

`GET /projects/:id`

`POST /projects/:id/retry`

`GET /metrics`

`GET /health`

API container restarts after `pg_notify` but before response

Packages