Fix broker reconnection: channel-based restart with persistent watchdog by shawnburke · Pull Request #84 · cortexapps/axon

shawnburke · 2026-03-08T10:28:53Z

No description provided.

Overhaul broker restart logic to fix a ~30 minute reconnection gap observed in "all" mode after registration deletes. Root cause: when the broker died and the supervisor exhausted retries, all background goroutines (auto-register, idle timeout) exited with the done channel, leaving nothing to trigger recovery. Changes: 1. Channel-based restart with generation dedup All restart triggers (WS tunnel death, idle timeout, broker exit) now send a restartRequest{reason, generation} to a single buffered channel. A dedicated consumer goroutine deduplicates by generation: stale requests from a previous broker lifecycle are discarded. This replaces scattered direct Restart() calls with ad-hoc cooldown timers. 2. Persistent watchdog with backoff The restart consumer retries failed restarts with exponential backoff (5s, 10s, 20s... capped at 5min). This ensures the broker always recovers, even if re-registration temporarily fails. 3. Supervisor panic only on first start The panic on max retries now only fires during initial startup (fail-fast for misconfiguration). On subsequent restarts, the error propagates to the watchdog for retry instead of crashing the agent. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS

Tests the full reconnection path: force-kills the snyk-broker server container with SIGKILL (non-graceful disconnect), restarts it, and verifies the axon relay reconnects and can pass traffic again. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS

Instead of a separate goroutine for idle timeout detection, the restart consumer now ticks every minute and calls shouldRestart() to check if the broker has been idle too long. If so, it produces a restart request that flows through the same generation-check + retry logic. This eliminates one goroutine per broker instance and centralizes all restart decision-making in a single loop. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS

getRandomPort() was picking a random port in the 51000-52000 range, which could collide between sequential tests if the OS hadn't released the port yet. Use net.Listen(":0") to get a genuinely free port from the OS instead. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS

Reduces worst-case overshoot beyond the configured idle timeout from ~59s to ~14s. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS

shawnburke · 2026-03-09T08:19:25Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

shawnburke force-pushed the claude/fix-primus-reconnection-z3AUa branch from 24f949f to 93c0457 Compare March 8, 2026 10:31

claude added 4 commits March 8, 2026 20:17

Tick idle check every 15s instead of 1m

683e3dd

Reduces worst-case overshoot beyond the configured idle timeout from ~59s to ~14s. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS

shawnburke merged commit 1c4283a into main Mar 9, 2026
16 checks passed

shawnburke deleted the claude/fix-primus-reconnection-z3AUa branch March 9, 2026 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broker reconnection: channel-based restart with persistent watchdog#84

Fix broker reconnection: channel-based restart with persistent watchdog#84
shawnburke merged 5 commits intomainfrom
claude/fix-primus-reconnection-z3AUa

shawnburke commented Mar 8, 2026

Uh oh!

shawnburke commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shawnburke commented Mar 8, 2026

Uh oh!

shawnburke commented Mar 9, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants