Skip to content

Conversation

@mdear
Copy link

@mdear mdear commented Dec 29, 2025

This commit introduces two major features: WebSocket session resilience for surviving temporary disconnects, and proactive context budget management to prevent context window overflow.

Session Resilience (Backend + Frontend)

  • JWT-based session security with HttpOnly fingerprint cookie binding (RFC 8725)
  • Dual heartbeat mechanism (server 30s ping + client 20s heartbeat)
  • 120-second reconnection grace period with event buffering
  • Automatic token refresh at 90% of JWT lifespan
  • Run state preservation during disconnects

New files:

  • core/api/connection_manager.py - Connection state and event buffering
  • core/api/session_security.py - JWT lifecycle and fingerprint binding
  • frontend/lib/sessionManager.ts - Client-side session management

Context Budget Management

  • Provider-aware token counting (Anthropic, OpenAI, Google APIs)
  • Circuit breaker pattern: warning at 40%, force completion at 55%
  • Budget-aware content selection for work module inheritance
  • Per-worker budget allocation for parallel execution

New files:

  • core/agent_core/llm/token_counter.py - Accurate provider-specific counting
  • core/agent_core/framework/context_budget_guardian.py - Threshold monitoring
  • core/agent_core/utils/content_selection.py - Budget-aware inheritance

Test Coverage

  • 878 backend unit tests (33 new test files)
  • 25 frontend tests for session management
  • Coverage for all new modules

Documentation

  • docs/architecture/session-resilience.md - Full design specification
  • docs/architecture/context-budget-management.md - Budget system design
  • docs/guides/04-debugging.md - Added CLI tools documentation
  • scripts/analyze_session.py - Session analysis utility
  • scripts/commonground.sh - Service manager script

Other Changes

  • Graceful shutdown with connection cleanup
  • Updated pyproject.toml with uv export instructions
  • Anthropic-specific LLM configs for accurate token budgeting
  • Agent profile updates for budget-aware operation

This commit introduces two major features: WebSocket session resilience for
surviving temporary disconnects, and proactive context budget management to
prevent context window overflow.

## Session Resilience (Backend + Frontend)

- JWT-based session security with HttpOnly fingerprint cookie binding (RFC 8725)
- Dual heartbeat mechanism (server 30s ping + client 20s heartbeat)
- 120-second reconnection grace period with event buffering
- Automatic token refresh at 90% of JWT lifespan
- Run state preservation during disconnects

New files:
- core/api/connection_manager.py - Connection state and event buffering
- core/api/session_security.py - JWT lifecycle and fingerprint binding
- frontend/lib/sessionManager.ts - Client-side session management

## Context Budget Management

- Provider-aware token counting (Anthropic, OpenAI, Google APIs)
- Circuit breaker pattern: warning at 40%, force completion at 55%
- Budget-aware content selection for work module inheritance
- Per-worker budget allocation for parallel execution

New files:
- core/agent_core/llm/token_counter.py - Accurate provider-specific counting
- core/agent_core/framework/context_budget_guardian.py - Threshold monitoring
- core/agent_core/utils/content_selection.py - Budget-aware inheritance

## Test Coverage

- 878 backend unit tests (33 new test files)
- 25 frontend tests for session management
- Coverage for all new modules

## Documentation

- docs/architecture/session-resilience.md - Full design specification
- docs/architecture/context-budget-management.md - Budget system design
- docs/guides/04-debugging.md - Added CLI tools documentation
- scripts/analyze_session.py - Session analysis utility
- scripts/commonground.sh - Service manager script

## Other Changes

- Graceful shutdown with connection cleanup
- Updated pyproject.toml with uv export instructions
- Anthropic-specific LLM configs for accurate token budgeting
- Agent profile updates for budget-aware operation
@mdear
Copy link
Author

mdear commented Dec 29, 2025

Hi, team, here are some stability and resilience fixes that I did to support integration of my own MCP server (proprietary knowledge base for wheelchair seating/mobility, which is capable of quickly overwhelming a model's context without proper controls).

My strengths lie mostly in backend infrastructure, so I kept my frontend changes light, really only enough so I could have enough stability to be able to properly evaluate this solution.

I introduced unit test infras, capturing all backend current behavior.
I only did light unit testing on the frontend, would appreciate any review from those with more expertise than I.

Respect! This is my way of showing in a (hopefully) useful way that I support you and what you are trying to do.

Any and all constructive criticism/review/suggestions are welcome.

Myles Dear added 6 commits December 30, 2025 23:18
…wareness

Context Budget System:
- Add context_admission_controller for pre-admission budget enforcement
- Add context_budget_handback for Principal-delegated summarization
- Update thresholds: WARNING 60%, CRITICAL 75%, EXCEEDED 85%
- Implement agent-type-aware forcing (Principal/Associate only)
- Partner agents receive guidance only (no flow-ending tools)

Orphan Detection:
- Add detect_orphaned_tool_interactions() to turn_manager
- Add finalize_orphaned_tool_interactions() for recovery
- Add detect_dispatch_anomalies() to dispatcher_node

Session Analysis (analyze_session.py):
- Add --mode handoff/thrashing/errors analysis modes
- Fix analyze_work_modules() to aggregate ALL context_archive entries
- Add dispatch_count tracking for thrashing detection
- Improve error detection to avoid false positives

Bug Fixes:
- Fix DuckDBRAGStore unawaited coroutine warning (lazy init)
- Rename test_jina_* to check_jina_* to avoid pytest auto-discovery
- Remove unused pythonjsonlogger import (deprecation warning)
- Fix sessionManager to always create fresh session_id for WS

Frontend:
- Increase node fallback dimensions for better visual fit
- Fix sessionManager reconnection flow

Docs:
- Update context-budget-management.md with implementation status

Tests: 934 passed, 1 skipped
Flow visualization improvements:
- Add dynamic minZoom that adapts to card count (see all cards at min zoom)
- Fix maxZoom at 1.5x for readable card text regardless of card count
- Align scroll wheel zoom speed between minimap and canvas (~9 clicks)
- Add translateExtent to constrain panning within node bounds
- Add status-based MiniMap colors (blue=running, green=success, red=error)

Scroll and layout fixes:
- Fix page-level scrolling by adding overflow:hidden to html/body/SidebarProvider
- Fix auto-scroll on page load (scrollIntoView block:'nearest')
- Add overscroll-contain to ChatHistory to prevent scroll chaining

Swim lane layout (flow-utils.ts):
- Rewrite layout algorithm for fixed-width swim lanes per agent
- Increase node fallback dimensions for better readability
- Add minimum dimension enforcement in getNodeSize()

Files changed:
- FlowView.tsx: zoom config, MiniMap styling, ReactFlowProvider wrapper
- ChatLayout.tsx: overflow-hidden on panels
- Workspace.tsx: overflow-hidden on container
- ChatHistory.tsx: overscroll-contain
- flow-utils.ts: swim lane algorithm
- globals.css: html/body overflow hidden
- layout.tsx: SidebarProvider height constraints
- r/page.tsx: scrollIntoView fix
Associates that output JSON deliverables without calling `finish_flow`
would have their work lost, as the system only triggers deliverable
extraction when `finish_flow` is invoked. Live session analysis
revealed this caused re-dispatching.

The `generate_message_summary` instructional prompt told agents "DO NOT
call any tools" after outputting JSON, but `finish_flow` IS required to
trigger `_extract_deliverables_from_messages()` and capture the work.

- Updated instructional prompt to explicitly describe the 3-response
  sequence: generate_message_summary → JSON output → finish_flow
- Added critical warning about deliverable capture requirement

- Added "Finish Protocol" section documenting the completion sequence
- Updated self-reflection to detect JSON-without-finish_flow state
- Fixed observation text to match actual trigger conditions

- Added "CRITICAL CHECK" for JSON deliverable detection
- Updated instructions to guide agents through finish protocol
- Fixed incomplete sentence ("MUST synthesis" → proper guidance)

- Updated Deliver step to mention `finish_flow` requirement

- Analyzed production runs confirming the JSON → finish_flow
 sequence across all completed work modules
- All unit tests pass
- No regressions expected - changes are corrective/additive
Flow visualization now groups disconnected subgraphs into time-sorted
epochs, ensuring timestamps always flow top-to-bottom (swimlane style).

Changes:
- Detect epochs via flood-fill of disconnected turn subgraphs
- Sort epochs by earliest timestamp for chronological ordering
- Add epoch separator nodes between epochs with proper labels
- Add "Epoch 1" header when multiple epochs exist
- Create edges connecting separators to adjacent epoch roots/leaves
- Filter Partner and user_turn before epoch detection
- Add epoch_separator nodeType to frontend FlowView component
- Update FlowViewModel documentation in API reference
- Add comprehensive unit tests for epoch detection logic

Fixes issue where cards from re-dispatched work modules appeared
out of chronological order in the flow visualization.
- Detect disconnected subgraphs as epochs, sort by timestamp
- Add epoch separator nodes with edges to adjacent epochs
- Show "Epoch N" headers only when multiple epochs exist
- Filter Partner/user_turn before epoch detection
- Update frontend to render epoch_separator nodeType
- Update API docs for FlowViewModel epoch fields
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant