--- phase: 03-lifecycle-management plan: 02 type: execute wave: 2 depends_on: ["03-01"] files_modified: - telegram/bot.py autonomous: true must_haves: truths: - "Session suspends automatically after idle timeout (subprocess terminated, status set to suspended)" - "User message to suspended session resumes it with --continue and shows 'Resuming session...' status" - "Resume failure sends error to user and does not auto-create fresh session" - "Race between timeout-fire and user-message is prevented by asyncio.Lock" - "Bot startup kills orphaned subprocess PIDs and sets all sessions to suspended" - "Bot shutdown terminates all subprocesses gracefully (SIGTERM + 5s timeout + SIGKILL)" - "/timeout sets per-session idle timeout (1-120 range)" - "/sessions lists all sessions with status indicator, persona, and last active time" artifacts: - path: "telegram/bot.py" provides: "Suspend/resume wiring, idle timers, /timeout, /sessions, startup cleanup, graceful shutdown" contains: "idle_timers" key_links: - from: "telegram/bot.py" to: "telegram/idle_timer.py" via: "import and instantiate SessionIdleTimer per session" pattern: "from idle_timer import SessionIdleTimer" - from: "telegram/bot.py on_complete callback" to: "idle_timer.reset()" via: "Timer starts after Claude finishes processing" pattern: "idle_timers.*reset" - from: "telegram/bot.py handle_message" to: "resume logic" via: "Detect suspended session, spawn with --continue, send status" pattern: "Resuming session" - from: "telegram/bot.py suspend_session" to: "ClaudeSubprocess.terminate()" via: "Idle timer fires, terminates subprocess" pattern: "await.*terminate" --- Wire suspend/resume lifecycle, idle timers, new commands, and cleanup into the bot. Purpose: This is the core integration plan that makes sessions automatically suspend after idle timeout, resume transparently on user message, and provides /timeout + /sessions commands. Also adds startup orphan cleanup and graceful shutdown signal handling. Output: Updated `bot.py` with full lifecycle management @/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md @/home/mikkel/.claude/get-shit-done/templates/summary.md @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-lifecycle-management/03-CONTEXT.md @.planning/phases/03-lifecycle-management/03-RESEARCH.md @.planning/phases/03-lifecycle-management/03-01-SUMMARY.md @telegram/bot.py @telegram/idle_timer.py @telegram/session_manager.py @telegram/claude_subprocess.py Task 1: Suspend/resume wiring with race locks, startup cleanup, and graceful shutdown telegram/bot.py This is the core lifecycle wiring in bot.py. Make these changes: **New imports and globals:** - `import signal, os` (for shutdown handlers and PID checks) - `from idle_timer import SessionIdleTimer` - Add global dict: `idle_timers: dict[str, SessionIdleTimer] = {}` - Add global dict: `subprocess_locks: dict[str, asyncio.Lock] = {}` (one lock per session, prevents races between timeout-fire and user-message) **Helper: get_subprocess_lock(session_name)** - Returns existing lock or creates new one for session. Pattern: `subprocess_locks.setdefault(session_name, asyncio.Lock())` **Suspend function: `async def suspend_session(session_name: str)`** - This is the idle timer's on_timeout callback. - Acquire the session's subprocess lock. - Check if subprocess exists and is_alive. If not alive, just update metadata and return. - Check `subprocesses[session_name].is_busy` -- if busy, DON'T suspend (Claude is mid-processing). Instead, reset the idle timer to try again later. Log this. Return. - Store the subprocess PID for logging. - Call `await subprocesses[session_name].terminate()` (existing method with SIGTERM + timeout + SIGKILL). - Remove from `subprocesses` dict. - Flush and remove batcher if exists: `if session_name in batchers: await batchers[session_name].flush_immediately(); del batchers[session_name]` - Update session metadata: `session_manager.update_session(session_name, status='suspended', pid=None)` - Cancel and remove idle timer: `if session_name in idle_timers: idle_timers[session_name].cancel(); del idle_timers[session_name]` - Log: `logger.info(f"Session '{session_name}' suspended after idle timeout")` - DECISION (from CONTEXT.md): Silent suspension -- do NOT send any Telegram message. **Modify make_callbacks() -- add on_complete idle timer integration:** - The `on_complete` callback already exists. Wrap it: after existing logic (stop typing), add idle timer reset: ```python # Reset idle timer (only start counting AFTER Claude finishes) if session_name in idle_timers: idle_timers[session_name].reset() ``` - This ensures timer only starts when Claude is truly idle, never during processing. **Modify handle_message() -- add resume logic:** - After checking for active session, BEFORE the subprocess check, add: ```python # Acquire lock to prevent race with suspend_session lock = get_subprocess_lock(active_session) async with lock: ``` Wrap the subprocess get-or-create and message send in this lock. - Inside the lock, when subprocess is not alive: 1. Check if session has `.claude/` dir (has history). If yes, this is a resume. 2. If resuming: send status message to user: `"Resuming session..."` (include idle duration if >1 min from metadata last_active). Example: `"Resuming session (idle for 15 min)..."` 3. Spawn subprocess normally (the existing ClaudeSubprocess constructor + start() already handles --continue when .claude/ exists). 4. Store PID in metadata: `session_manager.update_session(active_session, status='active', last_active=now_iso, pid=subprocesses[active_session].pid)` - After sending message (outside lock), create/reset idle timer for the session: ```python timeout_secs = session_manager.get_session_timeout(active_session) if active_session not in idle_timers: idle_timers[active_session] = SessionIdleTimer(active_session, timeout_secs, on_timeout=suspend_session) # Don't reset here -- timer resets in on_complete when Claude finishes ``` - IMPORTANT: Also reset the idle timer when user sends a message (user activity should reset timer too, per CONTEXT.md): ```python if active_session in idle_timers: idle_timers[active_session].reset() ``` Put this BEFORE sending to subprocess (so timer is reset even if message queues). **Similarly update handle_photo() and handle_document():** - Add the same lock acquisition, resume detection, and idle timer reset as handle_message(). - Keep the existing photo/document save and notification logic. **Modify new_session() -- initialize idle timer after creation:** - After subprocess creation, add: ```python timeout_secs = session_manager.get_session_timeout(name) idle_timers[name] = SessionIdleTimer(name, timeout_secs, on_timeout=suspend_session) ``` - Store PID in metadata: after subprocess is created/started, `session_manager.update_session(name, pid=subprocesses[name].pid)` (only after start()). Note: The existing code creates ClaudeSubprocess but does NOT call start() -- start happens lazily on first send_message. So PID tracking happens in handle_message when subprocess auto-starts. **Modify switch_session_cmd():** - Per CONTEXT.md LOCKED decision: switching sessions leaves previous subprocess running (it suspends on its own timer). Do NOT cancel old session's idle timer. - When auto-spawning subprocess for new session, set up idle timer as above. **Modify archive_session_cmd():** - Cancel idle timer if exists: `if name in idle_timers: idle_timers[name].cancel(); del idle_timers[name]` - Remove subprocess lock if exists: `subprocess_locks.pop(name, None)` **Modify model_cmd():** - After terminating subprocess for model change, cancel idle timer: `if active_session in idle_timers: idle_timers[active_session].cancel(); del idle_timers[active_session]` **Startup cleanup function: `async def cleanup_orphaned_subprocesses()`** - Called once at bot startup (before polling starts). - Iterate all sessions via `session_manager.list_sessions()`. - For each session with a non-None `pid`: 1. Check if PID process exists: `os.kill(pid, 0)` wrapped in try/except ProcessLookupError. 2. If process exists, verify it's a claude process: read `/proc/{pid}/cmdline`, check if "claude" is in it. If not claude, skip killing. 3. If it IS a claude process: `os.kill(pid, signal.SIGTERM)`, sleep 2s, then try `os.kill(pid, signal.SIGKILL)` (catch ProcessLookupError if already dead). 4. Update metadata: `session_manager.update_session(session['name'], pid=None, status='suspended')` - For sessions with status != 'suspended' and no pid, also set status to 'suspended'. - Log summary: "Cleaned up N orphaned subprocesses" **Graceful shutdown:** - python-telegram-bot's `Application.run_polling()` handles signal installation internally. Instead of overriding signal handlers (which conflicts with the library), use the `post_shutdown` callback: ```python async def post_shutdown(application): """Clean up subprocesses and timers on bot shutdown.""" logger.info("Bot shutting down, cleaning up...") # Cancel all idle timers for name, timer in idle_timers.items(): timer.cancel() # Terminate all subprocesses for name, proc in list(subprocesses.items()): if proc.is_alive: logger.info(f"Terminating subprocess for '{name}'") await proc.terminate() logger.info("Cleanup complete") ``` - Register in main(): `app.post_shutdown = post_shutdown` - Also add a `post_init` callback for startup cleanup: ```python async def post_init(application): """Run startup cleanup.""" await cleanup_orphaned_subprocesses() ``` Register: `app = Application.builder().token(TOKEN).post_init(post_init).build()` **Update help text:** - Add `/timeout ` and `/sessions` to the help_command text under "Claude Sessions" section. `python3 -c "import bot"` from telegram/ directory should not error (syntax check). Look for: idle_timers dict, subprocess_locks dict, suspend_session function, cleanup_orphaned_subprocesses function, post_shutdown callback. - suspend_session() terminates subprocess on idle timeout, updates metadata to suspended, silent (no Telegram notification) - handle_message() detects suspended session, sends "Resuming session..." status, spawns with --continue - Race lock prevents concurrent suspend + resume on same session - Startup cleanup kills orphaned PIDs verified via /proc/cmdline - Graceful shutdown terminates all subprocesses and cancels all timers - handle_photo/handle_document also support resume from suspended state Task 2: /timeout and /sessions commands telegram/bot.py Add two new command handlers to bot.py: **/timeout command: `async def timeout_cmd(update, context)`** - Auth check (same pattern as other commands). - If no active session: reply "No active session. Use /new to start one." - If no args: show current timeout. ```python timeout_secs = session_manager.get_session_timeout(active_session) minutes = timeout_secs // 60 await update.message.reply_text(f"Idle timeout: {minutes} minutes\n\nUsage: /timeout (1-120)") ``` - If args: parse first arg as int. - Validate range 1-120. If out of range: `"Timeout must be between 1 and 120 minutes"` - If not a valid int: `"Invalid number. Usage: /timeout "` - Convert to seconds: `timeout_seconds = minutes * 60` - Update session metadata: `session_manager.update_session(active_session, idle_timeout=timeout_seconds)` - If idle timer exists for this session, update its timeout_seconds attribute and reset: `idle_timers[active_session].timeout_seconds = timeout_seconds; idle_timers[active_session].reset()` - Reply: `f"Idle timeout set to {minutes} minutes for session '{active_session}'."` **/sessions command: `async def sessions_cmd(update, context)`** - Auth check. - Get all sessions: `session_manager.list_sessions()` (already sorted by last_active desc). - If empty: reply "No sessions. Use /new to create one." - Build formatted list. For each session: - Status indicator: active subprocess running -> "LIVE", status == "active" (in metadata) -> "ACTIVE", status == "suspended" -> "IDLE", else -> status - Actually, check real subprocess state: `name in subprocesses and subprocesses[name].is_alive` -> "LIVE" - Format last_active as relative time (e.g., "2m ago", "1h ago", "3d ago") using a small helper function: ```python def format_relative_time(iso_str): dt = datetime.fromisoformat(iso_str) delta = datetime.now(timezone.utc) - dt secs = delta.total_seconds() if secs < 60: return "just now" if secs < 3600: return f"{int(secs/60)}m ago" if secs < 86400: return f"{int(secs/3600)}h ago" return f"{int(secs/86400)}d ago" ``` - Mark current active session with arrow prefix. - Format line: `"{marker}{status_emoji} {name} ({persona}) - {relative_time}"` - Status emojis: LIVE -> green circle, IDLE/suspended -> white circle - Join lines, reply with parse_mode='Markdown'. Use backticks around session names for monospace. **Register handlers in main():** - `app.add_handler(CommandHandler("timeout", timeout_cmd))` -- after the model handler - `app.add_handler(CommandHandler("sessions", sessions_cmd))` -- after the session handler **Update help text in help_command():** - Under "Claude Sessions" section, add: - `/sessions` - List all sessions with status - `/timeout ` - Set idle timeout (1-120) `python3 -c "import bot; print('OK')"` succeeds. Grep for "timeout_cmd" and "sessions_cmd" in bot.py to confirm both exist. Grep for "CommandHandler.*timeout" and "CommandHandler.*sessions" to confirm registration. - /timeout shows current timeout when called without args, sets timeout (1-120 min range) when called with arg - /sessions lists all sessions sorted by last active, showing live/idle status, persona, relative time - Both commands registered as handlers in main() - Help text updated with new commands 1. `cd ~/homelab/telegram && python3 -c "import bot; print('All OK')"` -- no import errors 2. Grep for key integration points: - `grep -n "suspend_session" telegram/bot.py` -- suspend function exists - `grep -n "idle_timers" telegram/bot.py` -- idle timer dict used - `grep -n "subprocess_locks" telegram/bot.py` -- race locks exist - `grep -n "cleanup_orphaned" telegram/bot.py` -- startup cleanup exists - `grep -n "post_shutdown" telegram/bot.py` -- graceful shutdown exists - `grep -n "Resuming session" telegram/bot.py` -- resume status message exists - `grep -n "timeout_cmd\|sessions_cmd" telegram/bot.py` -- new commands exist 3. Restart bot service: `systemctl --user restart telegram-bot.service && sleep 2 && systemctl --user status telegram-bot.service` -- should show active - Session auto-suspends after idle timeout (subprocess terminated, metadata status=suspended, no Telegram notification) - Message to suspended session shows "Resuming session..." then Claude responds with full history - If resume fails, error message sent (no auto-fresh-start) - asyncio.Lock prevents race between timeout-fire and incoming message - Bot startup kills orphaned subprocess PIDs (verified via /proc/cmdline) - Bot shutdown terminates all subprocesses gracefully - /timeout sets per-session idle timeout (1-120 range), shows current value without args - /sessions lists all sessions with LIVE/IDLE status, persona, and relative last-active time - Help text includes new commands - Bot service restarts cleanly After completion, create `.planning/phases/03-lifecycle-management/03-02-SUMMARY.md`