homelab/.planning/phases/03-lifecycle-management/03-02-PLAN.md
Mikkel Georgsen 88cd339a54 docs(03): create phase plan for lifecycle management
Phase 03: Lifecycle Management
- 2 plans in 2 waves
- Plan 01 (wave 1): Idle timer module + session metadata + PID tracking
- Plan 02 (wave 2): Suspend/resume wiring, /timeout, /sessions, startup cleanup, graceful shutdown
- Ready for execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 23:20:04 +00:00

16 KiB

phase plan type wave depends_on files_modified autonomous must_haves
03-lifecycle-management 02 execute 2
03-01
telegram/bot.py
true
truths artifacts key_links
Session suspends automatically after idle timeout (subprocess terminated, status set to suspended)
User message to suspended session resumes it with --continue and shows 'Resuming session...' status
Resume failure sends error to user and does not auto-create fresh session
Race between timeout-fire and user-message is prevented by asyncio.Lock
Bot startup kills orphaned subprocess PIDs and sets all sessions to suspended
Bot shutdown terminates all subprocesses gracefully (SIGTERM + 5s timeout + SIGKILL)
/timeout <minutes> sets per-session idle timeout (1-120 range)
/sessions lists all sessions with status indicator, persona, and last active time
path provides contains
telegram/bot.py Suspend/resume wiring, idle timers, /timeout, /sessions, startup cleanup, graceful shutdown idle_timers
from to via pattern
telegram/bot.py telegram/idle_timer.py import and instantiate SessionIdleTimer per session from idle_timer import SessionIdleTimer
from to via pattern
telegram/bot.py on_complete callback idle_timer.reset() Timer starts after Claude finishes processing idle_timers.*reset
from to via pattern
telegram/bot.py handle_message resume logic Detect suspended session, spawn with --continue, send status Resuming session
from to via pattern
telegram/bot.py suspend_session ClaudeSubprocess.terminate() Idle timer fires, terminates subprocess await.*terminate
Wire suspend/resume lifecycle, idle timers, new commands, and cleanup into the bot.

Purpose: This is the core integration plan that makes sessions automatically suspend after idle timeout, resume transparently on user message, and provides /timeout + /sessions commands. Also adds startup orphan cleanup and graceful shutdown signal handling. Output: Updated bot.py with full lifecycle management

<execution_context> @/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md @/home/mikkel/.claude/get-shit-done/templates/summary.md </execution_context>

@.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/03-lifecycle-management/03-CONTEXT.md @.planning/phases/03-lifecycle-management/03-RESEARCH.md @.planning/phases/03-lifecycle-management/03-01-SUMMARY.md @telegram/bot.py @telegram/idle_timer.py @telegram/session_manager.py @telegram/claude_subprocess.py Task 1: Suspend/resume wiring with race locks, startup cleanup, and graceful shutdown telegram/bot.py This is the core lifecycle wiring in bot.py. Make these changes:

New imports and globals:

  • import signal, os (for shutdown handlers and PID checks)
  • from idle_timer import SessionIdleTimer
  • Add global dict: idle_timers: dict[str, SessionIdleTimer] = {}
  • Add global dict: subprocess_locks: dict[str, asyncio.Lock] = {} (one lock per session, prevents races between timeout-fire and user-message)

Helper: get_subprocess_lock(session_name)

  • Returns existing lock or creates new one for session. Pattern: subprocess_locks.setdefault(session_name, asyncio.Lock())

Suspend function: async def suspend_session(session_name: str)

  • This is the idle timer's on_timeout callback.
  • Acquire the session's subprocess lock.
  • Check if subprocess exists and is_alive. If not alive, just update metadata and return.
  • Check subprocesses[session_name].is_busy -- if busy, DON'T suspend (Claude is mid-processing). Instead, reset the idle timer to try again later. Log this. Return.
  • Store the subprocess PID for logging.
  • Call await subprocesses[session_name].terminate() (existing method with SIGTERM + timeout + SIGKILL).
  • Remove from subprocesses dict.
  • Flush and remove batcher if exists: if session_name in batchers: await batchers[session_name].flush_immediately(); del batchers[session_name]
  • Update session metadata: session_manager.update_session(session_name, status='suspended', pid=None)
  • Cancel and remove idle timer: if session_name in idle_timers: idle_timers[session_name].cancel(); del idle_timers[session_name]
  • Log: logger.info(f"Session '{session_name}' suspended after idle timeout")
  • DECISION (from CONTEXT.md): Silent suspension -- do NOT send any Telegram message.

Modify make_callbacks() -- add on_complete idle timer integration:

  • The on_complete callback already exists. Wrap it: after existing logic (stop typing), add idle timer reset:
    # Reset idle timer (only start counting AFTER Claude finishes)
    if session_name in idle_timers:
        idle_timers[session_name].reset()
    
  • This ensures timer only starts when Claude is truly idle, never during processing.

Modify handle_message() -- add resume logic:

  • After checking for active session, BEFORE the subprocess check, add:
    # Acquire lock to prevent race with suspend_session
    lock = get_subprocess_lock(active_session)
    async with lock:
    
    Wrap the subprocess get-or-create and message send in this lock.
  • Inside the lock, when subprocess is not alive:
    1. Check if session has .claude/ dir (has history). If yes, this is a resume.
    2. If resuming: send status message to user: "Resuming session..." (include idle duration if >1 min from metadata last_active). Example: "Resuming session (idle for 15 min)..."
    3. Spawn subprocess normally (the existing ClaudeSubprocess constructor + start() already handles --continue when .claude/ exists).
    4. Store PID in metadata: session_manager.update_session(active_session, status='active', last_active=now_iso, pid=subprocesses[active_session].pid)
  • After sending message (outside lock), create/reset idle timer for the session:
    timeout_secs = session_manager.get_session_timeout(active_session)
    if active_session not in idle_timers:
        idle_timers[active_session] = SessionIdleTimer(active_session, timeout_secs, on_timeout=suspend_session)
    # Don't reset here -- timer resets in on_complete when Claude finishes
    
  • IMPORTANT: Also reset the idle timer when user sends a message (user activity should reset timer too, per CONTEXT.md):
    if active_session in idle_timers:
        idle_timers[active_session].reset()
    
    Put this BEFORE sending to subprocess (so timer is reset even if message queues).

Similarly update handle_photo() and handle_document():

  • Add the same lock acquisition, resume detection, and idle timer reset as handle_message().
  • Keep the existing photo/document save and notification logic.

Modify new_session() -- initialize idle timer after creation:

  • After subprocess creation, add:
    timeout_secs = session_manager.get_session_timeout(name)
    idle_timers[name] = SessionIdleTimer(name, timeout_secs, on_timeout=suspend_session)
    
  • Store PID in metadata: after subprocess is created/started, session_manager.update_session(name, pid=subprocesses[name].pid) (only after start()). Note: The existing code creates ClaudeSubprocess but does NOT call start() -- start happens lazily on first send_message. So PID tracking happens in handle_message when subprocess auto-starts.

Modify switch_session_cmd():

  • Per CONTEXT.md LOCKED decision: switching sessions leaves previous subprocess running (it suspends on its own timer). Do NOT cancel old session's idle timer.
  • When auto-spawning subprocess for new session, set up idle timer as above.

Modify archive_session_cmd():

  • Cancel idle timer if exists: if name in idle_timers: idle_timers[name].cancel(); del idle_timers[name]
  • Remove subprocess lock if exists: subprocess_locks.pop(name, None)

Modify model_cmd():

  • After terminating subprocess for model change, cancel idle timer: if active_session in idle_timers: idle_timers[active_session].cancel(); del idle_timers[active_session]

Startup cleanup function: async def cleanup_orphaned_subprocesses()

  • Called once at bot startup (before polling starts).
  • Iterate all sessions via session_manager.list_sessions().
  • For each session with a non-None pid:
    1. Check if PID process exists: os.kill(pid, 0) wrapped in try/except ProcessLookupError.
    2. If process exists, verify it's a claude process: read /proc/{pid}/cmdline, check if "claude" is in it. If not claude, skip killing.
    3. If it IS a claude process: os.kill(pid, signal.SIGTERM), sleep 2s, then try os.kill(pid, signal.SIGKILL) (catch ProcessLookupError if already dead).
    4. Update metadata: session_manager.update_session(session['name'], pid=None, status='suspended')
  • For sessions with status != 'suspended' and no pid, also set status to 'suspended'.
  • Log summary: "Cleaned up N orphaned subprocesses"

Graceful shutdown:

  • python-telegram-bot's Application.run_polling() handles signal installation internally. Instead of overriding signal handlers (which conflicts with the library), use the post_shutdown callback:
    async def post_shutdown(application):
        """Clean up subprocesses and timers on bot shutdown."""
        logger.info("Bot shutting down, cleaning up...")
    
        # Cancel all idle timers
        for name, timer in idle_timers.items():
            timer.cancel()
    
        # Terminate all subprocesses
        for name, proc in list(subprocesses.items()):
            if proc.is_alive:
                logger.info(f"Terminating subprocess for '{name}'")
                await proc.terminate()
    
        logger.info("Cleanup complete")
    
  • Register in main(): app.post_shutdown = post_shutdown
  • Also add a post_init callback for startup cleanup:
    async def post_init(application):
        """Run startup cleanup."""
        await cleanup_orphaned_subprocesses()
    
    Register: app = Application.builder().token(TOKEN).post_init(post_init).build()

Update help text:

  • Add /timeout <minutes> and /sessions to the help_command text under "Claude Sessions" section. python3 -c "import bot" from telegram/ directory should not error (syntax check). Look for: idle_timers dict, subprocess_locks dict, suspend_session function, cleanup_orphaned_subprocesses function, post_shutdown callback.
  • suspend_session() terminates subprocess on idle timeout, updates metadata to suspended, silent (no Telegram notification)
  • handle_message() detects suspended session, sends "Resuming session..." status, spawns with --continue
  • Race lock prevents concurrent suspend + resume on same session
  • Startup cleanup kills orphaned PIDs verified via /proc/cmdline
  • Graceful shutdown terminates all subprocesses and cancels all timers
  • handle_photo/handle_document also support resume from suspended state
Task 2: /timeout and /sessions commands telegram/bot.py Add two new command handlers to bot.py:

/timeout command: async def timeout_cmd(update, context)

  • Auth check (same pattern as other commands).
  • If no active session: reply "No active session. Use /new to start one."
  • If no args: show current timeout.
    timeout_secs = session_manager.get_session_timeout(active_session)
    minutes = timeout_secs // 60
    await update.message.reply_text(f"Idle timeout: {minutes} minutes\n\nUsage: /timeout <minutes> (1-120)")
    
  • If args: parse first arg as int.
    • Validate range 1-120. If out of range: "Timeout must be between 1 and 120 minutes"
    • If not a valid int: "Invalid number. Usage: /timeout <minutes>"
    • Convert to seconds: timeout_seconds = minutes * 60
    • Update session metadata: session_manager.update_session(active_session, idle_timeout=timeout_seconds)
    • If idle timer exists for this session, update its timeout_seconds attribute and reset: idle_timers[active_session].timeout_seconds = timeout_seconds; idle_timers[active_session].reset()
    • Reply: f"Idle timeout set to {minutes} minutes for session '{active_session}'."

/sessions command: async def sessions_cmd(update, context)

  • Auth check.
  • Get all sessions: session_manager.list_sessions() (already sorted by last_active desc).
  • If empty: reply "No sessions. Use /new to create one."
  • Build formatted list. For each session:
    • Status indicator: active subprocess running -> "LIVE", status == "active" (in metadata) -> "ACTIVE", status == "suspended" -> "IDLE", else -> status
    • Actually, check real subprocess state: name in subprocesses and subprocesses[name].is_alive -> "LIVE"
    • Format last_active as relative time (e.g., "2m ago", "1h ago", "3d ago") using a small helper function:
      def format_relative_time(iso_str):
          dt = datetime.fromisoformat(iso_str)
          delta = datetime.now(timezone.utc) - dt
          secs = delta.total_seconds()
          if secs < 60: return "just now"
          if secs < 3600: return f"{int(secs/60)}m ago"
          if secs < 86400: return f"{int(secs/3600)}h ago"
          return f"{int(secs/86400)}d ago"
      
    • Mark current active session with arrow prefix.
    • Format line: "{marker}{status_emoji} {name} ({persona}) - {relative_time}"
    • Status emojis: LIVE -> green circle, IDLE/suspended -> white circle
  • Join lines, reply with parse_mode='Markdown'. Use backticks around session names for monospace.

Register handlers in main():

  • app.add_handler(CommandHandler("timeout", timeout_cmd)) -- after the model handler
  • app.add_handler(CommandHandler("sessions", sessions_cmd)) -- after the session handler

Update help text in help_command():

  • Under "Claude Sessions" section, add:
    • /sessions - List all sessions with status
    • /timeout <minutes> - Set idle timeout (1-120) python3 -c "import bot; print('OK')" succeeds. Grep for "timeout_cmd" and "sessions_cmd" in bot.py to confirm both exist. Grep for "CommandHandler.*timeout" and "CommandHandler.*sessions" to confirm registration.
  • /timeout shows current timeout when called without args, sets timeout (1-120 min range) when called with arg
  • /sessions lists all sessions sorted by last active, showing live/idle status, persona, relative time
  • Both commands registered as handlers in main()
  • Help text updated with new commands
1. `cd ~/homelab/telegram && python3 -c "import bot; print('All OK')"` -- no import errors 2. Grep for key integration points: - `grep -n "suspend_session" telegram/bot.py` -- suspend function exists - `grep -n "idle_timers" telegram/bot.py` -- idle timer dict used - `grep -n "subprocess_locks" telegram/bot.py` -- race locks exist - `grep -n "cleanup_orphaned" telegram/bot.py` -- startup cleanup exists - `grep -n "post_shutdown" telegram/bot.py` -- graceful shutdown exists - `grep -n "Resuming session" telegram/bot.py` -- resume status message exists - `grep -n "timeout_cmd\|sessions_cmd" telegram/bot.py` -- new commands exist 3. Restart bot service: `systemctl --user restart telegram-bot.service && sleep 2 && systemctl --user status telegram-bot.service` -- should show active

<success_criteria>

  • Session auto-suspends after idle timeout (subprocess terminated, metadata status=suspended, no Telegram notification)
  • Message to suspended session shows "Resuming session..." then Claude responds with full history
  • If resume fails, error message sent (no auto-fresh-start)
  • asyncio.Lock prevents race between timeout-fire and incoming message
  • Bot startup kills orphaned subprocess PIDs (verified via /proc/cmdline)
  • Bot shutdown terminates all subprocesses gracefully
  • /timeout sets per-session idle timeout (1-120 range), shows current value without args
  • /sessions lists all sessions with LIVE/IDLE status, persona, and relative last-active time
  • Help text includes new commands
  • Bot service restarts cleanly </success_criteria>
After completion, create `.planning/phases/03-lifecycle-management/03-02-SUMMARY.md`