Terminal + AI Workspace for Disaster Recovery, Cloud Monitoring, Site-Down Assistant & DDoS Safeguard (Local-First, Enterprise-Ready)
(βsecond brotherβ of CloudDeploy β same architecture, new mission: restore service fast, safely, and auditably.)
If you've ever lost hours during an outage because logs are scattered, tools are inconsistent, approvals are unclear, or everyone is guessing β CloudRecovery is for you.
CloudRecovery is a recovery workspace that runs your real ops/DR CLIs in a browser (left panel), while an AI SRE copilot (right panel) consumes sanitized, live signals (alerts/events/logs/synthetics) and turns chaos into an executable, policy-guarded recovery plan β with always-on monitoring agents and autopilot modes designed for safe MTTR reduction.
β If CloudRecovery saves you even one incident, please star the repo.
- π₯οΈ Real Terminal in the Browser (PTY-backed, not fake logs)
- π Live Streaming Output + prompt detection (CloudDeploy DNA)
- π€ AI Copilot reads sanitized terminal tail + incident signals
- π― Plan β Approve β Execute recovery workflow (commands are never executed silently)
- π§° MCP Tool Server (same tool layer powers UI + agents β no duplicated automation)
- π§Ύ Audit-Friendly UX: timeline, evidence snapshots, approvals, post-incident summary
- π₯ OpenShift (OCP) Support: watch events/pods, rollout actions, safe restarts/rollback (policy-gated)
- βοΈ Hybrid Estate Support: OpenShift + Oracle instances + EC2 instances
- π§β
βοΈ Human-in-the-loop by default (prod-safe), Autopilot when enabled - π 24/7 Monitoring via Linux Agent daemon (systemd service)
- π Site-Down Assistant: DNS/TLS/HTTP triage + Docker/K8s quick hints
- π‘οΈ Emergency DDoS Monitor (observe-only): top talkers + SYN flood hints + latency/5xx triggers
- π¦ Ransomware & Integrity Watch (heuristic): suspicious file extensions + high CPU + auth hints
- π Cloud Identity & Security Hygiene (heuristic): IMDS exposure + risky env vars + K8s SA token checks
- π§ͺ Production-grade interactive monitor script:
scripts/monitor_anything.shwith Docker/K8s listing + mode selection
CloudRecovery combines four things into one workflow:
- Runs a real PTY-backed terminal session in your browser
- Streams output live
- Detects interactive prompts & steps
- Shows Assistant / Summary / Issues in a clean enterprise UI
- Reads redacted terminal output + evidence (redaction by default)
- Explains whatβs happening in plain language
- Produces ranked hypotheses
- Generates executable plans and runbooks
- Helps troubleshoot failures with safe, actionable steps
- Exposes terminal + recovery actions as tools (stdio MCP)
- Enables external orchestrators/agents to observe and act (policy-guarded)
- Same tool layer powers UI autopilot
- A daemon installed on Linux hosts (systemd)
- Continuously collects health + OpenShift signals + synthetics
- Pushes evidence to the control plane
- (When enabled) executes approved runbooks under policy gates
- π©βπ» Faster onboarding: same recovery UX across engineers and environments
- π₯ Lower MTTR: less βwhere do I look?β time β evidence is pulled automatically
- π§Ύ Audit-ready: evidence + actions + approvals + timeline export
- π‘οΈ Safe automation: policies + risk labels + approvals + two-person gates
- π§© Extensible: add providers, WAF/CDN connectors, runbook packs, and policy packs
- π Local-first / Bastion-friendly: run in an ops workstation, jump host, or hardened runner
- Hosts the terminal workspace + AI copilot
- Receives evidence from agents (and local scripts)
- Streams evidence via WebSocket:
/ws/signals - Agent APIs:
POST /api/agent/heartbeatPOST /api/agent/evidenceGET /api/agent/commands(poll channel; can be upgraded to WS)POST /api/agent/command(enqueue)GET /api/evidence/tail
- Health endpoint:
GET /health - Session controls (recommended for production):
POST /api/session/stopPOST /api/autopilot/disableGET /api/session/status
- Collectors:
agent:host(CPU/mem/disk)agent:ocp(events/pods, CrashLoopBackOff detection)synthetics(DNS/TLS/HTTP checks when configured)
- Pushes evidence to control plane continuously
- (Optional) executes safe runbooks when autopilot enabled and policy allows
CloudRecovery ships/uses a production-grade interactive script (example: scripts/monitor_anything.sh) that:
- Lists running Docker containers and lets the user select one
- Lists Kubernetes namespaces/deployments and lets the user select targets
- Includes Site-Down Assistant and Emergency DDoS Monitor (observe-only)
- Can optionally push evidence to the control plane using env vars
pip install cloudrecoveryCloudRecovery runs locally and uses your system tools (oc / kubectl / cloud CLIs / SSH / etc).
No vendor lock-in: the AI provider is configurable.
- Python 3.11+
- macOS / Linux recommended (PTY-based runner)
- Windows supported via WSL2 (recommended)
ocinstalled and available in PATH- kubeconfig present for the runtime user (control plane runner or agent)
- Agent installed on Linux hosts where you want system-level telemetry
- systemd available
Run the Web Workspace (Terminal + AI):
cloudrecovery ui --cmd bash --host 127.0.0.1 --port 8787Open:
Health check:
curl http://127.0.0.1:8787/healthTip: you can run any interactive CLI wizard β prompt detection is pluggable.
If your repo includes scripts/monitor_anything.sh:
chmod +x scripts/monitor_anything.sh
./scripts/monitor_anything.shcloudrecovery ui --cmd ./scripts/monitor_anything.sh --host 127.0.0.1 --port 8787export CLOUDRECOVERY_CONTROL_PLANE="https://cloudrecovery.example.com"
export CLOUDRECOVERY_AGENT_TOKEN="REPLACE"
export CLOUDRECOVERY_AGENT_ID="monitor-wizard-1"
export CLOUDRECOVERY_EMIT_EVIDENCE="1"
./scripts/monitor_anything.shThe script is local-first and observe-only by default (no automatic remediation).
sudo mkdir -p /etc/cloudrecovery
sudo cp cloudrecovery/agent/agent.yaml.example /etc/cloudrecovery/agent.yaml
sudo nano /etc/cloudrecovery/agent.yamlExample:
agent_id: "agent-ocp-prod-1"
control_plane_url: "https://cloudrecovery-control-plane.example.com"
token: "REPLACE_WITH_SHARED_SECRET"
env: "prod"
autopilot_enabled: false
synthetics_url: "https://your-service.example.com/health"
poll_interval_s: 15.0
openshift_enabled: true
host_enabled: truesudo cp cloudrecovery/agent/systemd/cloudrecovery-agent.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now cloudrecovery-agentsudo systemctl status cloudrecovery-agent
journalctl -u cloudrecovery-agent -fControl plane supports a shared token (upgrade to mTLS later).
Set on the control plane host:
export CLOUDRECOVERY_AGENT_TOKEN="REPLACE_WITH_SHARED_SECRET"
cloudrecovery ui --cmd bash --host 0.0.0.0 --port 8787Agent config must match:
token: "REPLACE_WITH_SHARED_SECRET"CloudRecovery adds OpenShift MCP tools through oc:
ocp.get_podsocp.get_eventsocp.rollout_statusocp.list_namespaces
ocp.rollout_restart(medium risk)ocp.scale_deployment(medium risk)ocp.rollout_undo(high risk β typically two-person in prod)
In prod, mutating actions default to approval required.
CloudRecovery ships built-in checks:
- DNS resolution
- TLS handshake
- HTTP status + latency
Run via API:
curl -X POST http://127.0.0.1:8787/api/synthetics/check \
-H 'Content-Type: application/json' \
-d '{"url":"https://example.com/health"}'Agents can run synthetics continuously if synthetics_url is set in the agent config.
Use this when your service is βdownβ and you need structured evidence fast:
-
DNS failure vs TLS failure vs connect failure vs HTTP 5xx/4xx
-
Optional quick hints from:
- Docker container state/health
- Kubernetes βbad podβ counts (CrashLoopBackOff, ImagePullBackOff, Pending)
Outputs explicit triggers like:
trigger=dns_failtrigger=tls_failtrigger=connect_failtrigger=http_5xx
Designed for βis this a DDoS?β triage without making changes:
- HTTP latency + 5xx symptoms
- SYN-RECV state count hint (Linux best-effort)
- conntrack top destination ports (Linux best-effort)
- top talkers from origin access logs (nginx/apache, best-effort)
- emits an AI-friendly
next_checkshint line (WAF, rate limits, bot score, autoscaling, LB health, top URLs)
This does not block traffic. Itβs a safe triage tool that helps responders decide the next action.
Runbooks live here:
cloudrecovery/runbooks/packs/
Included examples:
crashloopbackoff_openshift.yamlsite_down_basic.yaml
Runbooks define:
- triggers (what incident symptom they address)
- steps (actions/commands)
- gates (verification)
- rollback steps (if needed)
Autopilot executes runbooks (not freeform LLM commands) in production setups.
CloudRecovery keeps CloudDeployβs autopilot behavior and adds incident-grade autopilot:
- evidence collection only
- read-only commands
- no state-changing actions
- executes pre-approved runbook steps
- pauses at policy gates
- requires approvals for mutating steps in prod
- fast iteration mode
- still validated by policy engine
- enable only in explicitly configured environments
Terminal logs sent to the AI are sanitized (cloudrecovery/redact.py):
- masks API keys/tokens/passwords
- masks Bearer tokens
- can optionally redact
.envvalues while keeping keys
-
terminal command validation (
cloudrecovery/mcp/policy.py) -
recovery action validation (
cloudrecovery/mcp/action_policy.py) -
environment packs:
cloudrecovery/policy/packs/prod.yamlcloudrecovery/policy/packs/staging.yaml
You run CloudRecovery locally / on a bastion / on a hardened recovery runner:
- no credential harvesting
- no remote terminal execution layer required
- commands execute in your PTY (you see them typing)
Manifest:
deploy/openshift/cloudrecovery-control-plane.yaml
Apply:
oc apply -f deploy/openshift/cloudrecovery-control-plane.yamlBefore applying:
- replace
REPLACE_IMAGE - create secret
cloudrecovery-secretswith keyagent_token
CloudRecovery can run as a tool server for external agents/orchestrators:
cloudrecovery mcp --cmd bashExample tool call:
echo '{"id":"1","tool":"cli.read","args":{"tail_chars":1200,"redact":true}}' \
| cloudrecovery mcp --cmd bashCloudRecovery uses cloudrecovery/llm/llm_provider.py and supports:
- watsonx.ai (default)
- OpenAI
- Claude (Anthropic)
- Ollama (local)
Example (watsonx.ai):
export GITPILOT_PROVIDER=watsonx
export WATSONX_API_KEY="YOUR_KEY"
export WATSONX_PROJECT_ID="YOUR_PROJECT_ID"
export WATSONX_BASE_URL="https://us-south.ml.cloud.ibm.com"
export GITPILOT_WATSONX_MODEL="ibm/granite-3-8b-instruct"CloudRecovery includes CI/CD health checks via .github/workflows/health-check.yml.
Whatβs tested:
- β
Server startup and health endpoint (
/health) - β Agent authentication (token security)
- β MCP tool registration (session, cli, policy tools)
- β Policy engine (blocks dangerous commands, allows safe ones)
- β Redaction functionality (masks secrets/API keys)
- β Runbook discovery and schema validation
- β Production readiness checks (required files, security configs)
Triggers:
- On push to
mainorclaude/**branches - On pull requests to
main - Every 6 hours (scheduled)
- Manual workflow dispatch
Run locally:
curl http://127.0.0.1:8787/health
pytest tests/ -v
make lint- Live WebSocket feed of incidents, alerts, health metrics
- Agent heartbeats every 15 seconds (configurable)
- Severity levels:
info,warning,critical - Sources:
agent:host,agent:ocp,synthetics,monitor_wizard
- CPU, memory, disk usage tracking
- OpenShift pod status (CrashLoopBackOff detection)
- Synthetic checks (DNS, TLS, HTTP latency)
- Automatic buffering during network outages (agent-side)
- Terminal output (left panel)
- AI copilot analysis (right panel)
- Live evidence timeline with timestamps
- Autopilot execution status
- Policy-guarded automation (validates commands before execution)
- Redaction by default (never sends secrets to LLMs)
- Approval gates (mutating actions require human approval in prod)
- Rollback support (runbooks include rollback steps)
- Audit trail (timeline export for post-incident review)
CloudRecovery is designed to be extended with notifications.
# Example integration point (not included by default)
async def send_admin_alert(incident, admin_emails):
"""
Send email/Slack notification when critical incidents are detected.
Include link to monitoring dashboard for real-time oversight.
"""
if incident.severity == "critical":
dashboard_link = f"https://cloudrecovery.example.com/?incident={incident.incident_id}"
# send via SMTP/SendGrid/Slack webhookEnvironment variables for notifications:
export CLOUDRECOVERY_SMTP_HOST="smtp.example.com"
export CLOUDRECOVERY_SMTP_PORT="587"
export CLOUDRECOVERY_SMTP_USER="alerts@example.com"
export CLOUDRECOVERY_SMTP_PASSWORD="***"
export CLOUDRECOVERY_ADMIN_EMAILS="admin1@example.com,admin2@example.com"
# Slack webhook (alternative)
export CLOUDRECOVERY_SLACK_WEBHOOK="https://hooks.slack.com/services/..."cloudrecovery ui --cmd bash --host 0.0.0.0 --port 8787
# Put behind SSO/MFA/auth proxy in production.Via API (if implemented in your control plane):
curl -X POST http://127.0.0.1:8787/api/session/stop
curl -X POST http://127.0.0.1:8787/api/autopilot/disable
curl http://127.0.0.1:8787/api/session/statusVia Web UI:
- βStop Autopilotβ
- βTerminate Sessionβ
- Full audit trail of actions
- Agent authentication configured (
CLOUDRECOVERY_AGENT_TOKEN) - Production policy pack active (
cloudrecovery/policy/packs/prod.yaml) - HTTPS enabled (reverse proxy: nginx/Caddy)
- Notification integrations configured (email/Slack)
- Runbooks tested in staging first
- Admin access controls (SSO/MFA recommended)
- Evidence retention policy defined (GDPR/compliance)
- Incident response playbook (escalation ownership)
- Health checks enabled (scheduled CI)
make sync
make test
make lintRun UI:
cloudrecovery ui --cmd bashPRs welcome for:
- OpenShift enhancements (RBAC, API-watch collectors)
- new runbook packs (DR failover, DB restore, DDoS edge response)
- enterprise policy packs (two-person approvals, blast-radius rules)
- UI improvements (signals dashboard, timeline export)
- new MCP tools (WAF/CDN, DNS, monitoring adapters)
Guidelines:
- safe-by-default automation
- never leak secrets; respect redaction
- validate all actions server-side
- keep mutating actions explicit and auditable
If you hit a tricky incident edge-case:
- capture sanitized logs (Export Logs button)
- open an issue with evidence + terminal tail
- propose a new runbook pack for the scenario
β If CloudRecovery helps your team recover faster, please star the repo.
Apache 2.0 β see LICENSE.
- β 24/7 Linux Agent daemon (systemd)
- β
Evidence store + live signals WebSocket (
/ws/signals) - β OpenShift monitoring + safe recovery actions (policy-gated)
- β Synthetics checks (DNS/TLS/HTTP)
- β Site-Down Assistant (explicit triggers + quick infra hints)
- β Emergency DDoS Monitor (observe-only triage)
- β Runbooks as code (packs) + rollback + verification gates
- β Policy packs (prod vs staging) for enterprise adoption
- β Automated health check workflow (CI/CD testing every 6 hours)
- β Production monitoring & alerting documentation
- β Emergency stop controls (API + Web UI)
Made with β€οΈ for SRE / DevOps teams who want lower MTTR without breaking production.


