Skip to content

Conversation

@ayushmi
Copy link
Owner

@ayushmi ayushmi commented Aug 25, 2025

This PR introduces a self-contained chaos testing harness and CI workflow for Phase‑1 (single‑node) durability/availability validation.

What’s included

  • Harness under AgentStateTesting/chaos with:
    • Workload: concurrent writers/readers driving public HTTP API.
    • Watcher: SSE consumer validating monotonic commit sequence and last-value agreement across nemeses.
    • Nemeses: pause/unpause (stall surrogate), crash/recovery, and disk‑full (tmpfs).
  • Compose overlay: size‑limited tmpfs at /data, stable container name, test‑only Debian runtime with tools.
  • CI: .github/workflows/chaos-ci.yml runs on PRs/nightly; prints container logs on failure.
  • Metric assertions: ensure watch_drops_total{reason="overflow"} == 0, watch_clients{proto="sse"} >= 1, and watch_events_total > 0.

Notes

  • No server/runtime code changes.
  • SSE resumes aren’t counted yet (only gRPC does); left as a follow‑up if desired.
  • Disk‑full is induced by filling tmpfs; we then free space and verify write recovery.

Follow‑ups (separate PRs)

  • Add watch-stream soaks to nightly.
  • Optional: track SSE resumes metric; clarify delete labeling in watch_events_total.
  • Expand to true network partitions once clustering lands.

Signed-off-by: Ayush Mittal ayushsmittal@gmail.com

…l via tmpfs), compose overlay, and CI workflow

Signed-off-by: Ayush Mittal <ayushsmittal@gmail.com>
…k-full

Signed-off-by: Ayush Mittal <ayushsmittal@gmail.com>
…to chaos suite

Signed-off-by: Ayush Mittal <ayushsmittal@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants