Add Primus WebSocket tunnel death detection for faster recovery#82
Merged
shawnburke merged 8 commits intomainfrom Mar 7, 2026
Merged
Add Primus WebSocket tunnel death detection for faster recovery#82shawnburke merged 8 commits intomainfrom
shawnburke merged 8 commits intomainfrom
Conversation
When the Primus WebSocket connection to the broker server drops (e.g. after long uptime), Primus falls back to HTTP polling. The polling requests return 200 but don't properly maintain client registration with the broker server, causing the instance to lose its registration. This change detects when Primus has degraded to polling by observing HTTP requests to /primus/ paths that aren't WebSocket upgrades. When detected, the auto-register loop forces a broker restart to re-establish a clean WebSocket connection. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
…auto-register - Changed from boolean flag to timestamp tracking (primusPollingFirstSeen) so we know *when* polling started, not just *if* it happened - Added PrimusPollingDuration() to check how long polling has persisted - Added dedicated goroutine that checks every 10s with a 30s grace period, instead of piggybacking on the 5-minute auto-register loop - The grace period avoids false positives: engine.io always starts with polling before upgrading to WebSocket, so brief polling during normal reconnection handshakes is expected and should not trigger a restart - Only active when ReflectsRegistration() is true (registration or all mode) Net effect: restart happens ~30-40s after WebSocket drops, not up to 5min. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
Instead of only detecting the downstream symptom (polling fallback),
now also track the tunnel lifecycle directly:
- Added primusTunnelConnected flag set/cleared in proxyWebSocket when
a /primus/ WebSocket tunnel is established or closes
- proxyWebSocket now clears polling detection when a fresh WebSocket
tunnel is established (normal engine.io upgrade succeeded)
- The monitor goroutine uses two-phase detection:
1. Tunnel death: detects when primusTunnelConnected goes from true→false
2. Polling grace period: waits 30s for engine.io to re-upgrade to
WebSocket; if still polling after that, forces a broker restart
This cuts detection time: the tunnel death is noticed within 10s
(check interval), then the 30s grace period gives engine.io a chance
to reconnect on its own before we intervene.
https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
…debounce Replace the two-phase polling detection + monitor goroutine approach with a simple callback from the reflector when the Primus WebSocket tunnel closes. The relay instance manager wires a 30s cooldown to prevent restart loops during startup. This removes ~80 lines of polling detection code (primusPollingFirstSeen, isPrimusPollingRequest, PrimusPollingDuration, ResetPrimusPollingDetected, the 10s-interval monitor goroutine) in favor of a direct tunnel-death callback that triggers an immediate broker restart. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
Any WebSocket tunnel through the reflector is broker traffic — there's no need to check if the path is /primus/. Trigger restart on any tunnel death, rename fields/methods from Primus-specific to generic WS tunnel naming. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
Move tunnel close handling into a defer block for safety. Add a minimum tunnel duration check (30s) — if a tunnel dies within seconds of opening, it likely indicates a fundamental issue where restarting won't help. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
The reflector should just fire the callback unconditionally on tunnel close. The relay instance manager's cooldown already handles the restart-loop prevention — no need for two overlapping checks. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
136d410 to
dbf1154
Compare
The traffic watermark is used to detect idle brokers. WebSocket tunnel connections (primus) are infrastructure traffic, not real caller traffic, so they should not reset the idle timer. https://claude.ai/code/session_017w1aQgtC1Khfxo9oAQQYiS
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.