Skip to content

Improve long-running stream timeout UX and restart auto-recovery#20

Closed
jhste102lab wants to merge 1 commit intoPleasePrompto:mainfrom
jhste102lab:fix/stream-timeout-restart-auto-resume
Closed

Improve long-running stream timeout UX and restart auto-recovery#20
jhste102lab wants to merge 1 commit intoPleasePrompto:mainfrom
jhste102lab:fix/stream-timeout-restart-auto-resume

Conversation

@jhste102lab
Copy link
Copy Markdown

Summary

This PR hardens long-running stream behavior and restart recovery to prevent user-visible work loss.

It introduces timeout warning stages, bounded timeout extension, startup lifecycle notifications, and automatic best-effort recovery of interrupted foreground and named-session work.

Problem

Users running long tasks were seeing streams terminate near timeout with weak recovery context. After service restart/reboot, interrupted work often required manual re-prompting, causing avoidable friction and repeated effort.

Root Cause

Timeout handling, startup recovery, and ingress exception boundaries were implemented as separate concerns without an end-to-end reliability contract.

Specifically:

  • timeout path had no staged warnings or bounded extension semantics,
  • startup path did not replay all interrupted work classes,
  • named sessions lacked persisted prompt context for safe resume,
  • message ingress could leak transient Telegram API/network errors upward.

Scope of Changes

1) ductor_bot/cli/executor.py

  • Added dynamic timeout cap via DUCTOR_DYNAMIC_TIMEOUT_MAX_SECONDS.
  • Added stream warning events at T-60s and T-10s:
    • timeout_warn_60
    • timeout_warn_10
  • Added bounded timeout extension event:
    • timeout_extended
  • Improved timeout completion messaging so users get actionable guidance instead of ambiguous failure signals.

2) ductor_bot/bot/message_dispatch.py

  • Added user-facing message labels for:
    • timeout_warn_60
    • timeout_warn_10
    • timeout_extended

3) ductor_bot/bot/app.py

  • Added startup lifecycle detection and broadcast:
    • service started
    • service restarted
    • reboot detected
  • Added persistent foreground in-flight turn tracking and startup replay.
  • Added startup auto-resume for named sessions interrupted before restart.
  • Added defensive exception handling in _on_message for TelegramAPIError and generic unexpected exceptions.

4) ductor_bot/session/named.py

  • Added last_prompt persistence for named sessions.
  • Added recovered-running session bookkeeping across startup.
  • Added helper APIs:
    • mark_running(...)
    • pop_recovered_running(...)

5) ductor_bot/orchestrator/core.py

  • Updated named-session flow to persist latest prompt via mark_running(...).
  • Added recovery accessor plumbing via pop_recovered_named_sessions(...).

Behavior Changes (User-visible)

  • Stream now shows pre-timeout warning signals before hard timeout.
  • Long-running tasks may continue via bounded extension (config-driven).
  • On startup, users receive explicit lifecycle status (start/restart/reboot).
  • Interrupted work is resumed automatically when recovery context is available.

Risk / Compatibility

  • Timeout extension is bounded and configurable; it does not create unbounded execution.
  • Recovery is best-effort with safety limits to avoid replay loops.
  • Existing workflows remain compatible; this is reliability hardening, not a workflow rewrite.

Validation

  • python3 -m py_compile passed for all modified files.
  • ✅ Runtime logs confirmed timeout warnings/extensions and startup recovery pathways.
  • ⚠️ Full pytest suite was not executed in this environment (pytest not installed).

Operational Notes

  • To enable/adjust max extension cap, set:
    • DUCTOR_DYNAMIC_TIMEOUT_MAX_SECONDS=<seconds>
  • Recommended to keep cap finite and aligned with operational limits.

Checklist

  • Root cause identified and documented
  • User-visible reliability gaps addressed
  • Startup recovery implemented for foreground + named sessions
  • Timeout UX improved with warnings and extension signal
  • Defensive ingress exception handling added
  • Validation run in target runtime

Closes #19

@PleasePrompto
Copy link
Copy Markdown
Owner

Hey @jhste102lab — thank you for the thorough analysis and the well-structured PR! Your work on timeout resilience, startup recovery, and auto-resume really nailed the core problems.

We couldn't merge this directly due to conflicts with parallel changes on main, but the ideas and approach have been adapted and integrated into v0.11.0. Specifically:

  • Timeout controller with staged warnings (T-60s, T-10s) and activity-based extension
  • Startup recovery with boot ID tracking, inflight persistence, and automatic resume
  • Named session recovery with last_prompt persistence

You're credited in the release notes as a contributor. Closing this PR since the changes are live — the related issue #19 is resolved as well.

Thanks again for pushing this forward! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Long-running stream timeout and restart/reboot can strand active work (missing warnings + auto-recovery)

2 participants