Skip to content

Add Primus WebSocket tunnel death detection for faster recovery#82

Merged
shawnburke merged 8 commits intomainfrom
claude/fix-primus-reconnection-z3AUa
Mar 7, 2026
Merged

Add Primus WebSocket tunnel death detection for faster recovery#82
shawnburke merged 8 commits intomainfrom
claude/fix-primus-reconnection-z3AUa

Conversation

@shawnburke
Copy link
Copy Markdown
Collaborator

No description provided.

claude added 7 commits March 7, 2026 06:40
When the Primus WebSocket connection to the broker server drops (e.g.
after long uptime), Primus falls back to HTTP polling. The polling
requests return 200 but don't properly maintain client registration
with the broker server, causing the instance to lose its registration.

This change detects when Primus has degraded to polling by observing
HTTP requests to /primus/ paths that aren't WebSocket upgrades. When
detected, the auto-register loop forces a broker restart to
re-establish a clean WebSocket connection.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
…auto-register

- Changed from boolean flag to timestamp tracking (primusPollingFirstSeen)
  so we know *when* polling started, not just *if* it happened
- Added PrimusPollingDuration() to check how long polling has persisted
- Added dedicated goroutine that checks every 10s with a 30s grace period,
  instead of piggybacking on the 5-minute auto-register loop
- The grace period avoids false positives: engine.io always starts with
  polling before upgrading to WebSocket, so brief polling during normal
  reconnection handshakes is expected and should not trigger a restart
- Only active when ReflectsRegistration() is true (registration or all mode)

Net effect: restart happens ~30-40s after WebSocket drops, not up to 5min.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
Instead of only detecting the downstream symptom (polling fallback),
now also track the tunnel lifecycle directly:

- Added primusTunnelConnected flag set/cleared in proxyWebSocket when
  a /primus/ WebSocket tunnel is established or closes
- proxyWebSocket now clears polling detection when a fresh WebSocket
  tunnel is established (normal engine.io upgrade succeeded)
- The monitor goroutine uses two-phase detection:
  1. Tunnel death: detects when primusTunnelConnected goes from true→false
  2. Polling grace period: waits 30s for engine.io to re-upgrade to
     WebSocket; if still polling after that, forces a broker restart

This cuts detection time: the tunnel death is noticed within 10s
(check interval), then the 30s grace period gives engine.io a chance
to reconnect on its own before we intervene.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
…debounce

Replace the two-phase polling detection + monitor goroutine approach with a
simple callback from the reflector when the Primus WebSocket tunnel closes.
The relay instance manager wires a 30s cooldown to prevent restart loops
during startup.

This removes ~80 lines of polling detection code (primusPollingFirstSeen,
isPrimusPollingRequest, PrimusPollingDuration, ResetPrimusPollingDetected,
the 10s-interval monitor goroutine) in favor of a direct tunnel-death
callback that triggers an immediate broker restart.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
Any WebSocket tunnel through the reflector is broker traffic — there's no
need to check if the path is /primus/. Trigger restart on any tunnel death,
rename fields/methods from Primus-specific to generic WS tunnel naming.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
Move tunnel close handling into a defer block for safety. Add a minimum
tunnel duration check (30s) — if a tunnel dies within seconds of opening,
it likely indicates a fundamental issue where restarting won't help.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
The reflector should just fire the callback unconditionally on tunnel
close. The relay instance manager's cooldown already handles the
restart-loop prevention — no need for two overlapping checks.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
@shawnburke shawnburke force-pushed the claude/fix-primus-reconnection-z3AUa branch from 136d410 to dbf1154 Compare March 7, 2026 06:40
The traffic watermark is used to detect idle brokers. WebSocket tunnel
connections (primus) are infrastructure traffic, not real caller
traffic, so they should not reset the idle timer.

https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
@shawnburke shawnburke merged commit 4a220b6 into main Mar 7, 2026
19 checks passed
@shawnburke shawnburke deleted the claude/fix-primus-reconnection-z3AUa branch March 7, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants