-
Notifications
You must be signed in to change notification settings - Fork 65
feat: Add session resilience and context budget management #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add session resilience and context budget management #9
Conversation
This commit introduces two major features: WebSocket session resilience for surviving temporary disconnects, and proactive context budget management to prevent context window overflow. ## Session Resilience (Backend + Frontend) - JWT-based session security with HttpOnly fingerprint cookie binding (RFC 8725) - Dual heartbeat mechanism (server 30s ping + client 20s heartbeat) - 120-second reconnection grace period with event buffering - Automatic token refresh at 90% of JWT lifespan - Run state preservation during disconnects New files: - core/api/connection_manager.py - Connection state and event buffering - core/api/session_security.py - JWT lifecycle and fingerprint binding - frontend/lib/sessionManager.ts - Client-side session management ## Context Budget Management - Provider-aware token counting (Anthropic, OpenAI, Google APIs) - Circuit breaker pattern: warning at 40%, force completion at 55% - Budget-aware content selection for work module inheritance - Per-worker budget allocation for parallel execution New files: - core/agent_core/llm/token_counter.py - Accurate provider-specific counting - core/agent_core/framework/context_budget_guardian.py - Threshold monitoring - core/agent_core/utils/content_selection.py - Budget-aware inheritance ## Test Coverage - 878 backend unit tests (33 new test files) - 25 frontend tests for session management - Coverage for all new modules ## Documentation - docs/architecture/session-resilience.md - Full design specification - docs/architecture/context-budget-management.md - Budget system design - docs/guides/04-debugging.md - Added CLI tools documentation - scripts/analyze_session.py - Session analysis utility - scripts/commonground.sh - Service manager script ## Other Changes - Graceful shutdown with connection cleanup - Updated pyproject.toml with uv export instructions - Anthropic-specific LLM configs for accurate token budgeting - Agent profile updates for budget-aware operation
|
Hi, team, here are some stability and resilience fixes that I did to support integration of my own MCP server (proprietary knowledge base for wheelchair seating/mobility, which is capable of quickly overwhelming a model's context without proper controls). My strengths lie mostly in backend infrastructure, so I kept my frontend changes light, really only enough so I could have enough stability to be able to properly evaluate this solution. I introduced unit test infras, capturing all backend current behavior. Respect! This is my way of showing in a (hopefully) useful way that I support you and what you are trying to do. Any and all constructive criticism/review/suggestions are welcome. |
…wareness Context Budget System: - Add context_admission_controller for pre-admission budget enforcement - Add context_budget_handback for Principal-delegated summarization - Update thresholds: WARNING 60%, CRITICAL 75%, EXCEEDED 85% - Implement agent-type-aware forcing (Principal/Associate only) - Partner agents receive guidance only (no flow-ending tools) Orphan Detection: - Add detect_orphaned_tool_interactions() to turn_manager - Add finalize_orphaned_tool_interactions() for recovery - Add detect_dispatch_anomalies() to dispatcher_node Session Analysis (analyze_session.py): - Add --mode handoff/thrashing/errors analysis modes - Fix analyze_work_modules() to aggregate ALL context_archive entries - Add dispatch_count tracking for thrashing detection - Improve error detection to avoid false positives Bug Fixes: - Fix DuckDBRAGStore unawaited coroutine warning (lazy init) - Rename test_jina_* to check_jina_* to avoid pytest auto-discovery - Remove unused pythonjsonlogger import (deprecation warning) - Fix sessionManager to always create fresh session_id for WS Frontend: - Increase node fallback dimensions for better visual fit - Fix sessionManager reconnection flow Docs: - Update context-budget-management.md with implementation status Tests: 934 passed, 1 skipped
Flow visualization improvements: - Add dynamic minZoom that adapts to card count (see all cards at min zoom) - Fix maxZoom at 1.5x for readable card text regardless of card count - Align scroll wheel zoom speed between minimap and canvas (~9 clicks) - Add translateExtent to constrain panning within node bounds - Add status-based MiniMap colors (blue=running, green=success, red=error) Scroll and layout fixes: - Fix page-level scrolling by adding overflow:hidden to html/body/SidebarProvider - Fix auto-scroll on page load (scrollIntoView block:'nearest') - Add overscroll-contain to ChatHistory to prevent scroll chaining Swim lane layout (flow-utils.ts): - Rewrite layout algorithm for fixed-width swim lanes per agent - Increase node fallback dimensions for better readability - Add minimum dimension enforcement in getNodeSize() Files changed: - FlowView.tsx: zoom config, MiniMap styling, ReactFlowProvider wrapper - ChatLayout.tsx: overflow-hidden on panels - Workspace.tsx: overflow-hidden on container - ChatHistory.tsx: overscroll-contain - flow-utils.ts: swim lane algorithm - globals.css: html/body overflow hidden - layout.tsx: SidebarProvider height constraints - r/page.tsx: scrollIntoView fix
Associates that output JSON deliverables without calling `finish_flow`
would have their work lost, as the system only triggers deliverable
extraction when `finish_flow` is invoked. Live session analysis
revealed this caused re-dispatching.
The `generate_message_summary` instructional prompt told agents "DO NOT
call any tools" after outputting JSON, but `finish_flow` IS required to
trigger `_extract_deliverables_from_messages()` and capture the work.
- Updated instructional prompt to explicitly describe the 3-response
sequence: generate_message_summary → JSON output → finish_flow
- Added critical warning about deliverable capture requirement
- Added "Finish Protocol" section documenting the completion sequence
- Updated self-reflection to detect JSON-without-finish_flow state
- Fixed observation text to match actual trigger conditions
- Added "CRITICAL CHECK" for JSON deliverable detection
- Updated instructions to guide agents through finish protocol
- Fixed incomplete sentence ("MUST synthesis" → proper guidance)
- Updated Deliver step to mention `finish_flow` requirement
- Analyzed production runs confirming the JSON → finish_flow
sequence across all completed work modules
- All unit tests pass
- No regressions expected - changes are corrective/additive
Flow visualization now groups disconnected subgraphs into time-sorted epochs, ensuring timestamps always flow top-to-bottom (swimlane style). Changes: - Detect epochs via flood-fill of disconnected turn subgraphs - Sort epochs by earliest timestamp for chronological ordering - Add epoch separator nodes between epochs with proper labels - Add "Epoch 1" header when multiple epochs exist - Create edges connecting separators to adjacent epoch roots/leaves - Filter Partner and user_turn before epoch detection - Add epoch_separator nodeType to frontend FlowView component - Update FlowViewModel documentation in API reference - Add comprehensive unit tests for epoch detection logic Fixes issue where cards from re-dispatched work modules appeared out of chronological order in the flow visualization.
- Detect disconnected subgraphs as epochs, sort by timestamp - Add epoch separator nodes with edges to adjacent epochs - Show "Epoch N" headers only when multiple epochs exist - Filter Partner/user_turn before epoch detection - Update frontend to render epoch_separator nodeType - Update API docs for FlowViewModel epoch fields
This commit introduces two major features: WebSocket session resilience for surviving temporary disconnects, and proactive context budget management to prevent context window overflow.
Session Resilience (Backend + Frontend)
New files:
Context Budget Management
New files:
Test Coverage
Documentation
Other Changes