Improve long-running stream timeout UX and restart auto-recovery by jhste102lab · Pull Request #20 · PleasePrompto/ductor

jhste102lab · 2026-03-02T03:32:40Z

Summary

This PR hardens long-running stream behavior and restart recovery to prevent user-visible work loss.

It introduces timeout warning stages, bounded timeout extension, startup lifecycle notifications, and automatic best-effort recovery of interrupted foreground and named-session work.

Problem

Users running long tasks were seeing streams terminate near timeout with weak recovery context. After service restart/reboot, interrupted work often required manual re-prompting, causing avoidable friction and repeated effort.

Root Cause

Timeout handling, startup recovery, and ingress exception boundaries were implemented as separate concerns without an end-to-end reliability contract.

Specifically:

timeout path had no staged warnings or bounded extension semantics,
startup path did not replay all interrupted work classes,
named sessions lacked persisted prompt context for safe resume,
message ingress could leak transient Telegram API/network errors upward.

Scope of Changes

1) `ductor_bot/cli/executor.py`

Added dynamic timeout cap via DUCTOR_DYNAMIC_TIMEOUT_MAX_SECONDS.
Added stream warning events at T-60s and T-10s:
- timeout_warn_60
- timeout_warn_10
Added bounded timeout extension event:
- timeout_extended
Improved timeout completion messaging so users get actionable guidance instead of ambiguous failure signals.

2) `ductor_bot/bot/message_dispatch.py`

Added user-facing message labels for:
- timeout_warn_60
- timeout_warn_10
- timeout_extended

3) `ductor_bot/bot/app.py`

Added startup lifecycle detection and broadcast:
- service started
- service restarted
- reboot detected
Added persistent foreground in-flight turn tracking and startup replay.
Added startup auto-resume for named sessions interrupted before restart.
Added defensive exception handling in _on_message for TelegramAPIError and generic unexpected exceptions.

4) `ductor_bot/session/named.py`

Added last_prompt persistence for named sessions.
Added recovered-running session bookkeeping across startup.
Added helper APIs:
- mark_running(...)
- pop_recovered_running(...)

5) `ductor_bot/orchestrator/core.py`

Updated named-session flow to persist latest prompt via mark_running(...).
Added recovery accessor plumbing via pop_recovered_named_sessions(...).

Behavior Changes (User-visible)

Stream now shows pre-timeout warning signals before hard timeout.
Long-running tasks may continue via bounded extension (config-driven).
On startup, users receive explicit lifecycle status (start/restart/reboot).
Interrupted work is resumed automatically when recovery context is available.

Risk / Compatibility

Timeout extension is bounded and configurable; it does not create unbounded execution.
Recovery is best-effort with safety limits to avoid replay loops.
Existing workflows remain compatible; this is reliability hardening, not a workflow rewrite.

Validation

✅ python3 -m py_compile passed for all modified files.
✅ Runtime logs confirmed timeout warnings/extensions and startup recovery pathways.
⚠️ Full pytest suite was not executed in this environment (pytest not installed).

Operational Notes

To enable/adjust max extension cap, set:
- DUCTOR_DYNAMIC_TIMEOUT_MAX_SECONDS=<seconds>
Recommended to keep cap finite and aligned with operational limits.

Checklist

Root cause identified and documented
User-visible reliability gaps addressed
Startup recovery implemented for foreground + named sessions
Timeout UX improved with warnings and extension signal
Defensive ingress exception handling added
Validation run in target runtime

Closes #19

PleasePrompto · 2026-03-02T16:05:25Z

Hey @jhste102lab — thank you for the thorough analysis and the well-structured PR! Your work on timeout resilience, startup recovery, and auto-resume really nailed the core problems.

We couldn't merge this directly due to conflicts with parallel changes on main, but the ideas and approach have been adapted and integrated into v0.11.0. Specifically:

Timeout controller with staged warnings (T-60s, T-10s) and activity-based extension
Startup recovery with boot ID tracking, inflight persistence, and automatic resume
Named session recovery with last_prompt persistence

You're credited in the release notes as a contributor. Closing this PR since the changes are live — the related issue #19 is resolved as well.

Thanks again for pushing this forward! 🙌

Improve timeout resilience, restart visibility, and auto-resume recovery

f522030

PleasePrompto closed this Mar 2, 2026

PleasePrompto mentioned this pull request Mar 2, 2026

Long-running stream timeout and restart/reboot can strand active work (missing warnings + auto-recovery) #19

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve long-running stream timeout UX and restart auto-recovery#20

Improve long-running stream timeout UX and restart auto-recovery#20
jhste102lab wants to merge 1 commit intoPleasePrompto:mainfrom
jhste102lab:fix/stream-timeout-restart-auto-resume

jhste102lab commented Mar 2, 2026

Uh oh!

PleasePrompto commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jhste102lab commented Mar 2, 2026

Summary

Problem

Root Cause

Scope of Changes

1) ductor_bot/cli/executor.py

2) ductor_bot/bot/message_dispatch.py

3) ductor_bot/bot/app.py

4) ductor_bot/session/named.py

5) ductor_bot/orchestrator/core.py

Behavior Changes (User-visible)

Risk / Compatibility

Validation

Operational Notes

Checklist

Uh oh!

PleasePrompto commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1) `ductor_bot/cli/executor.py`

2) `ductor_bot/bot/message_dispatch.py`

3) `ductor_bot/bot/app.py`

4) `ductor_bot/session/named.py`

5) `ductor_bot/orchestrator/core.py`