Skip to content

cordeous/cloud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mini Cloud Control Plane

A simplified cloud control plane demonstrating asynchronous project lifecycle management — similar to how platforms like Supabase provision backend resources on demand. Built from scratch with a TypeScript API, Go worker, PostgreSQL as both the data store and message broker, and a React dashboard.


Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          Docker Network                             │
│                                                                     │
│  ┌───────────────┐   HTTP    ┌─────────────────────────────────┐   │
│  │   Browser     │ ────────► │    React Frontend  :5173        │   │
│  │ localhost:5173│ ◄──────── │    Vite dev server              │   │
│  └───────────────┘           │    /api/* → proxy               │   │
│                              └────────────┬────────────────────┘   │
│                                           │ proxy /api → :3000      │
│                              ┌────────────▼────────────────────┐   │
│  ┌───────────────┐           │   TypeScript API   :3000        │   │
│  │  curl / tools │ ─────────►│   Node.js + Express             │   │
│  │ localhost:3001│           │                                 │   │
│  └───────────────┘           │  POST /projects                 │   │
│                              │    INSERT row (status=creating) │   │
│                              │    pg_notify("provisioning_jobs")│   │
│                              │  GET  /projects[/:id]           │   │
│                              │  POST /projects/:id/retry       │   │
│                              │  GET  /metrics   GET /health    │   │
│                              └──────────────┬──────────────────┘   │
│                                             │ SQL + NOTIFY          │
│                              ┌──────────────▼──────────────────┐   │
│                              │     PostgreSQL 16   :5432        │   │
│                              │  projects table                 │   │
│                              │  lifecycle_events table         │   │
│                              │  LISTEN/NOTIFY channel          │   │
│                              └──────────────┬──────────────────┘   │
│                                             │ LISTEN                │
│                              ┌──────────────▼──────────────────┐   │
│                              │      Go Worker                  │   │
│                              │  pq.Listener (dedicated conn)   │   │
│                              │  Claim job atomically           │   │
│                              │  Simulate 2–5s work             │   │
│                              │  80% → ready / 20% → failed     │   │
│                              │  Record lifecycle_events        │   │
│                              └─────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Request flow — creating a project

1.  Browser sends POST /api/projects
2.  Vite proxy strips /api, forwards to API :3000
3.  API inserts row: { status: "creating" }
4.  API records lifecycle event: none → creating
5.  API calls pg_notify("provisioning_jobs", '{"project_id":"<uuid>"}')
6.  API returns 201 immediately — no waiting for the worker
7.  Worker receives NOTIFY on its dedicated listener connection
8.  Worker: UPDATE projects SET status='provisioning' WHERE id=$1 AND status='creating'
    → 0 rows affected → another worker claimed it → skip (idempotency guard)
9.  Worker sleeps 2–5 seconds (simulated provisioning)
10. Worker writes: ready (80%) or failed (20%) + error_reason
11. Worker records lifecycle event: provisioning → ready|failed
12. Browser poll picks up the new status

Stack

Component Technology
Frontend React 18 + Vite + TypeScript
API Node.js 20 + Express + TypeScript
Worker Go 1.22 + lib/pq
Queue PostgreSQL pg_notify / LISTEN
Database PostgreSQL 16
Orchestration Docker Compose

Project Lifecycle State Machine

                     pg_notify
         ┌─────────────────────────────────────┐
         │                                     │
   ┌─────▼──────┐    Worker claims   ┌─────────┴────────┐
   │  creating  │ ────────────────► │   provisioning   │
   └────────────┘                   └────────┬─────────┘
                                             │
                          ┌──────────────────┤
                          │ 80%              │ 20%
                 ┌────────▼──────┐  ┌────────▼──────┐
                 │    ready      │  │    failed     │
                 └───────────────┘  └───────┬───────┘
                                            │
                                     POST /retry
                                            │
                                    (resets to creating,
                                     inside same TX as
                                     pg_notify — atomic)

Quick Start

# 1. Copy environment file
cp .env.example .env

# 2. Build and start all four services
docker compose up --build

# 3. Open the dashboard
open http://localhost:5173

# 4. Or hit the API directly
curl -s http://localhost:3001/health | jq

API Reference

All endpoints are available at http://localhost:3001 (host) or http://api:3000 (inside Docker).

POST /projects

Create a new project. Returns immediately (status: creating); provisioning is async.

curl -s -X POST http://localhost:3001/projects \
  -H "Content-Type: application/json" \
  -d '{"name": "my-project"}' | jq

Response 201:

{
  "id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
  "name": "my-project",
  "status": "creating",
  "error_reason": null,
  "created_at": "2024-01-01T00:00:00.000Z",
  "updated_at": "2024-01-01T00:00:00.000Z"
}

GET /projects

List all projects. Optional ?status= filter.

curl -s http://localhost:3001/projects | jq
curl -s "http://localhost:3001/projects?status=failed" | jq

GET /projects/:id

Get a single project by UUID.

curl -s http://localhost:3001/projects/<uuid> | jq

POST /projects/:id/retry

Re-queue a failed project. Only works when status === "failed". Returns 409 otherwise.

curl -s -X POST http://localhost:3001/projects/<uuid>/retry | jq

GET /metrics

Live counts by status.

curl -s http://localhost:3001/metrics | jq

Response:

{
  "counts_by_status": { "creating": 0, "provisioning": 1, "ready": 7, "failed": 2 },
  "failed_total": 2
}

GET /health

DB connectivity check. Used by Docker Compose health checks.

curl -s http://localhost:3001/health | jq

Dashboard (Frontend)

The React UI at http://localhost:5173 has three tabs:

Tab What it shows
Projects Create projects, watch live status transitions, retry failures
Metrics Success rate, avg provision time, status breakdown bar, failures table
Settings Poll interval (1s–30s), status filter, live API/DB health, stack info

Failure Scenarios

Worker crashes mid-provisioning

A project stuck in provisioning with no worker running will remain there indefinitely. There is no heartbeat/timeout mechanism in this implementation — a real system would add a background sweeper that resets stale provisioning rows back to creating after a configurable timeout.

PostgreSQL goes down while the API is running

The API uses a connection pool (pg.Pool). Requests that hit during the outage return 500. The startup loop retries the initial DB connection up to 15 times (2 s apart) before exiting. Once Postgres recovers, the pool reconnects automatically.

PostgreSQL goes down while the worker is running

pq.Listener has built-in reconnection with configurable min/max backoff (10 s–60 s). The outer ListenAndProcess loop adds an additional 5 s pause before restarting the listener. Any pg_notify events fired during the outage are lost — PostgreSQL does not buffer undelivered notifications. Projects in creating at the time of the outage will stay creating until manually retried.

Duplicate notifications for the same project

Can happen if the API retries a pg_notify or if two workers are running. The worker's atomic claim guard (UPDATE ... WHERE status='creating') ensures only one worker transitions the project. The second sees rowsAffected=0 and skips silently.

API container restarts after pg_notify but before response

The INSERT and pg_notify succeed (or both fail). If the container crashes after committing but before sending the HTTP response, the client gets a network error but the worker will still process the project. The client can poll GET /projects to confirm state.

Retry transaction rollback

POST /projects/:id/retry runs the status reset and the pg_notify inside a single transaction. If the transaction rolls back (e.g., DB error), PostgreSQL also discards the notification. No ghost jobs are enqueued.


Trade-offs

PostgreSQL as a queue vs. a dedicated broker (Kafka, RabbitMQ, SQS)

Chose PostgreSQL because:

  • Zero extra infrastructure — the queue is the same DB that stores state, so the notification and the row update are always consistent.
  • pg_notify fires inside transactions; if the transaction rolls back, the notification is discarded — free at-most-once delivery semantics.
  • Perfect for low-to-medium throughput (thousands of jobs/day).

Costs:

  • Notifications are not persisted. If no listener is connected when pg_notify fires, the message is dropped. A dedicated broker (Kafka, SQS) stores messages durably.
  • LISTEN uses a dedicated connection per worker — at very high worker counts this strains Postgres connection limits.
  • No backpressure or flow control; a burst of notifications floods workers immediately.

Polling vs. WebSockets / SSE on the frontend

Chose polling because:

  • Simple to implement and debug; no persistent connections to manage.
  • Adjustable interval (1–30 s) via the Settings tab.

Costs:

  • Even at 2 s polling, each browser tab sends 30 requests/min to the API. WebSockets or SSE would push updates only on changes, eliminating redundant requests.

Simulation vs. real provisioning

The worker's 2–5 s sleep and random outcome are intentional simplifications. In a real system this would be replaced by calls to a cloud provider SDK (e.g., AWS SDK, GCP client), Terraform, or Kubernetes API.


Scalability Notes

Horizontal API scaling

The API is stateless — multiple replicas behind a load balancer work without coordination. Each replica has its own pool connection to Postgres. The only shared state is the database.

Horizontal worker scaling

Multiple worker instances can run simultaneously. The idempotency guard (UPDATE ... WHERE status='creating') ensures each project is processed exactly once. Adding workers increases throughput linearly, up to Postgres connection limits.

Database bottlenecks

At high project creation rates:

  • projects and lifecycle_events grow unboundedly. Add a data-retention job to archive old rows.
  • The GROUP BY status query in GET /metrics does a full table scan. Add a materialized counter (e.g., a project_counts table updated by triggers) for O(1) metrics.
  • pg_notify payload is limited to 8 KB and notifications are not queued when no listener is connected. Use a pending_jobs table as a durable outbox for high-reliability requirements.

Connection limits

PostgreSQL defaults to 100 connections. The API pool uses up to 10. Each worker uses 1 query pool connection + 1 listener connection. With the default config you can run ~40 workers before hitting limits. Use PgBouncer in transaction-pooling mode to multiplex many app connections over fewer server connections.


Observability Strategy

Structured logging (current)

Every component emits JSON logs:

  • API — structured JSON via a custom log() helper. Every request/response includes request_id, method, path, status, duration_ms.
  • Worker — Go log/slog (built-in since Go 1.21). Every state transition logs project_id, worker_id, previous_state, new_state.

View live logs:

docker compose logs -f api
docker compose logs -f worker

What to add for production

Signal Tool What to instrument
Metrics Prometheus + Grafana projects_created_total, provisioning_duration_seconds histogram, projects_by_status gauge, API request rate/latency/error rate
Tracing OpenTelemetry + Jaeger Trace the full path: HTTP request → DB insert → pg_notify → worker LISTEN → final UPDATE; measure each span
Alerting Grafana Alertmanager Alert when failed rate exceeds threshold, when provisioning queue depth grows, when API p99 latency spikes
Error tracking Sentry Capture unhandled exceptions in API and worker with full stack traces
Health checks k8s liveness probes GET /health returns { status, db } — wire into readiness and liveness probes

Lifecycle events as an audit log

The lifecycle_events table records every state transition with a timestamp:

SELECT project_id, previous_state, new_state, occurred_at
FROM lifecycle_events
ORDER BY occurred_at DESC
LIMIT 50;

This can feed a real-time audit log UI, or be streamed to a data warehouse via Debezium CDC for analytics.


Development

# View lifecycle events in the DB
docker compose exec postgres psql -U clouduser -d clouddb \
  -c "SELECT project_id, previous_state, new_state, occurred_at \
      FROM lifecycle_events ORDER BY occurred_at DESC LIMIT 20;"

# Run only the database
docker compose up postgres

# Run API locally (outside Docker)
cd api && npm install && npm run dev

# Run worker locally (outside Docker)
cd worker && go run .

Stopping the Stack

docker compose down

# Also remove the database volume (destroys all data):
docker compose down -v

About

Cloud Control Plane Application

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors