Skip to content

Fix broker reconnection: channel-based restart with persistent watchdog#84

Merged
shawnburke merged 5 commits intomainfrom
claude/fix-primus-reconnection-z3AUa
Mar 9, 2026
Merged

Fix broker reconnection: channel-based restart with persistent watchdog#84
shawnburke merged 5 commits intomainfrom
claude/fix-primus-reconnection-z3AUa

Conversation

@shawnburke
Copy link
Copy Markdown
Collaborator

No description provided.

Overhaul broker restart logic to fix a ~30 minute reconnection gap
observed in "all" mode after registration deletes.

Root cause: when the broker died and the supervisor exhausted retries,
all background goroutines (auto-register, idle timeout) exited with
the done channel, leaving nothing to trigger recovery.

Changes:

1. Channel-based restart with generation dedup
   All restart triggers (WS tunnel death, idle timeout, broker exit)
   now send a restartRequest{reason, generation} to a single buffered
   channel. A dedicated consumer goroutine deduplicates by generation:
   stale requests from a previous broker lifecycle are discarded.
   This replaces scattered direct Restart() calls with ad-hoc
   cooldown timers.

2. Persistent watchdog with backoff
   The restart consumer retries failed restarts with exponential
   backoff (5s, 10s, 20s... capped at 5min). This ensures the broker
   always recovers, even if re-registration temporarily fails.

3. Supervisor panic only on first start
   The panic on max retries now only fires during initial startup
   (fail-fast for misconfiguration). On subsequent restarts, the
   error propagates to the watchdog for retry instead of crashing
   the agent.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
@shawnburke shawnburke force-pushed the claude/fix-primus-reconnection-z3AUa branch from 24f949f to 93c0457 Compare March 8, 2026 10:31
claude added 4 commits March 8, 2026 20:17
Tests the full reconnection path: force-kills the snyk-broker server
container with SIGKILL (non-graceful disconnect), restarts it, and
verifies the axon relay reconnects and can pass traffic again.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
Instead of a separate goroutine for idle timeout detection, the restart
consumer now ticks every minute and calls shouldRestart() to check if
the broker has been idle too long. If so, it produces a restart request
that flows through the same generation-check + retry logic.

This eliminates one goroutine per broker instance and centralizes all
restart decision-making in a single loop.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
getRandomPort() was picking a random port in the 51000-52000 range,
which could collide between sequential tests if the OS hadn't released
the port yet. Use net.Listen(":0") to get a genuinely free port from
the OS instead.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
Reduces worst-case overshoot beyond the configured idle timeout from
~59s to ~14s.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
@shawnburke
Copy link
Copy Markdown
Collaborator Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@shawnburke shawnburke merged commit 1c4283a into main Mar 9, 2026
16 checks passed
@shawnburke shawnburke deleted the claude/fix-primus-reconnection-z3AUa branch March 9, 2026 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants