docs(03): research phase domain
Phase 03: lifecycle-management - Process lifecycle patterns (suspend/resume) - Asyncio idle timeout detection - Graceful shutdown strategies - SIGTERM vs SIGSTOP tradeoffs - Claude Code --continue for resumption Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
134124f04e
commit
8f7b67a91b
1 changed files with 951 additions and 0 deletions
951
.planning/phases/03-lifecycle-management/03-RESEARCH.md
Normal file
951
.planning/phases/03-lifecycle-management/03-RESEARCH.md
Normal file
|
|
@ -0,0 +1,951 @@
|
|||
# Phase 3: Lifecycle Management - Research
|
||||
|
||||
**Researched:** 2026-02-04
|
||||
**Domain:** Process lifecycle (suspend/resume), asyncio idle timeout detection, graceful shutdown patterns, Claude Code --resume flag
|
||||
**Confidence:** HIGH
|
||||
|
||||
## Summary
|
||||
|
||||
Phase 3 implements automatic session suspension after configurable idle timeout and transparent resumption with full conversation history. The core technical challenges are: (1) detecting true idle state (no user messages AND no Claude activity), (2) choosing between SIGSTOP/SIGCONT (pause in-place) vs SIGTERM + --resume (terminate and restart), and (3) graceful cleanup on bot restart to prevent zombie processes.
|
||||
|
||||
Research confirms that asyncio provides robust timeout primitives (`asyncio.Event`, `asyncio.wait_for`, `asyncio.create_task`) for per-session idle timers. Claude Code's `--continue` flag already handles session resumption from `.claude/` state in the session directory — no separate `--resume` flag is needed when using persistent subprocesses in one directory. The critical decision is suspension method: SIGSTOP/SIGCONT saves spawn overhead but keeps memory allocated, while SIGTERM + restart trades memory for CPU overhead.
|
||||
|
||||
Key findings: (1) Idle detection requires tracking both user message time AND Claude completion time to avoid suspending mid-processing, (2) SIGSTOP/SIGCONT keeps process memory allocated but saves ~1s restart overhead, (3) SIGTERM + --continue is safer for long idle periods (releases memory, prevents stale state), (4) Graceful shutdown requires signal handlers to cancel idle timer tasks and terminate subprocesses with timeout + SIGKILL fallback.
|
||||
|
||||
**Primary recommendation:** Use SIGTERM + restart approach for suspension. Track last activity timestamp per session. After idle timeout, terminate subprocess gracefully (SIGTERM with 5s timeout, SIGKILL fallback). On next user message, spawn fresh subprocess with `--continue` to restore context. This balances memory efficiency (released during idle) with reasonable restart cost (~1s). Store timeout value in session metadata for per-session configuration.
|
||||
|
||||
## Standard Stack
|
||||
|
||||
The established libraries/tools for this domain:
|
||||
|
||||
### Core
|
||||
| Library | Version | Purpose | Why Standard |
|
||||
|---------|---------|---------|--------------|
|
||||
| asyncio | stdlib (3.12+) | Timeout detection, task scheduling, signal handling | Native async primitives for idle timers, event-based cancellation |
|
||||
| Claude Code CLI | 2.1.31+ | Session resumption via --continue | Built-in session state persistence to `.claude/` directory |
|
||||
| signal (stdlib) | stdlib | SIGTERM/SIGKILL for graceful shutdown | Standard Unix signal handling for process termination |
|
||||
|
||||
### Supporting
|
||||
| Library | Version | Purpose | When to Use |
|
||||
|---------|---------|---------|-------------|
|
||||
| datetime (stdlib) | stdlib | Last activity timestamps | Track idle periods per session |
|
||||
| json (stdlib) | stdlib | Session metadata updates | Store timeout configuration per session |
|
||||
|
||||
### Alternatives Considered
|
||||
| Instead of | Could Use | Tradeoff |
|
||||
|------------|-----------|----------|
|
||||
| SIGTERM + restart | SIGSTOP/SIGCONT | Pause keeps memory but saves 1s restart; terminate releases memory but costs CPU |
|
||||
| Per-session timers | Global timeout for all sessions | Per-session allows custom timeouts (long for task sessions, short for chat) |
|
||||
| asyncio.Event cancellation | Thread-based timers | asyncio integrates cleanly with subprocess management, threads add complexity |
|
||||
|
||||
**Installation:**
|
||||
```bash
|
||||
# All components are stdlib or already installed
|
||||
python3 --version # 3.12+ required for modern asyncio
|
||||
claude --version # 2.1.31 (already installed)
|
||||
```
|
||||
|
||||
## Architecture Patterns
|
||||
|
||||
### Recommended Lifecycle State Machine
|
||||
|
||||
```
|
||||
Session States:
|
||||
├── Created (no subprocess) → User message → Active
|
||||
├── Active (subprocess running, processing) → Completion → Idle
|
||||
├── Idle (subprocess running, waiting) → Timeout → Suspended
|
||||
├── Suspended (no subprocess) → User message → Active (restart)
|
||||
└── Any state → Bot restart → Suspended (cleanup)
|
||||
|
||||
Idle Timer:
|
||||
- Starts: After Claude completion event (subprocess.on_complete)
|
||||
- Resets: On user message OR Claude starts processing
|
||||
- Fires: After idle_timeout seconds of inactivity
|
||||
- Action: Terminate subprocess (SIGTERM, 5s timeout, SIGKILL fallback)
|
||||
```
|
||||
|
||||
### Pattern 1: Per-Session Idle Timer with asyncio
|
||||
**What:** Track last activity timestamp, spawn background task to check timeout, cancel on activity
|
||||
**When to use:** After each message completion, restart on new message
|
||||
**Example:**
|
||||
```python
|
||||
# Source: https://docs.python.org/3/library/asyncio-task.html
|
||||
import asyncio
|
||||
from datetime import datetime, timezone
|
||||
|
||||
class SessionIdleTimer:
|
||||
"""Manages idle timeout for a session."""
|
||||
|
||||
def __init__(self, session_name: str, timeout_seconds: int, on_timeout: callable):
|
||||
self.session_name = session_name
|
||||
self.timeout_seconds = timeout_seconds
|
||||
self.on_timeout = on_timeout
|
||||
self._timer_task: Optional[asyncio.Task] = None
|
||||
self._last_activity = datetime.now(timezone.utc)
|
||||
|
||||
def reset(self):
|
||||
"""Reset idle timer on activity."""
|
||||
self._last_activity = datetime.now(timezone.utc)
|
||||
|
||||
# Cancel existing timer
|
||||
if self._timer_task and not self._timer_task.done():
|
||||
self._timer_task.cancel()
|
||||
|
||||
# Start new timer
|
||||
self._timer_task = asyncio.create_task(self._wait_for_timeout())
|
||||
|
||||
async def _wait_for_timeout(self):
|
||||
"""Wait for timeout duration, then fire callback."""
|
||||
try:
|
||||
await asyncio.sleep(self.timeout_seconds)
|
||||
|
||||
# Timeout reached - fire callback
|
||||
await self.on_timeout(self.session_name)
|
||||
except asyncio.CancelledError:
|
||||
# Timer was reset by activity
|
||||
pass
|
||||
|
||||
def cancel(self):
|
||||
"""Cancel idle timer on session shutdown."""
|
||||
if self._timer_task and not self._timer_task.done():
|
||||
self._timer_task.cancel()
|
||||
|
||||
# Usage in bot
|
||||
idle_timers: dict[str, SessionIdleTimer] = {}
|
||||
|
||||
async def on_message_complete(session_name: str):
|
||||
"""Called when Claude finishes processing."""
|
||||
# Start idle timer after completion
|
||||
if session_name not in idle_timers:
|
||||
timeout = get_session_timeout(session_name) # From metadata
|
||||
idle_timers[session_name] = SessionIdleTimer(
|
||||
session_name,
|
||||
timeout,
|
||||
on_timeout=suspend_session
|
||||
)
|
||||
|
||||
idle_timers[session_name].reset()
|
||||
|
||||
async def on_user_message(session_name: str, message: str):
|
||||
"""Called when user sends message."""
|
||||
# Reset timer on activity
|
||||
if session_name in idle_timers:
|
||||
idle_timers[session_name].reset()
|
||||
|
||||
# Send to Claude...
|
||||
```
|
||||
|
||||
### Pattern 2: Graceful Subprocess Termination
|
||||
**What:** Send SIGTERM, wait for clean exit with timeout, SIGKILL if needed
|
||||
**When to use:** Suspending session, bot shutdown, session archival
|
||||
**Example:**
|
||||
```python
|
||||
# Source: https://roguelynn.com/words/asyncio-graceful-shutdowns/
|
||||
import asyncio
|
||||
import signal
|
||||
|
||||
async def terminate_subprocess_gracefully(
|
||||
process: asyncio.subprocess.Process,
|
||||
timeout: int = 5
|
||||
) -> None:
|
||||
"""
|
||||
Terminate subprocess with graceful shutdown.
|
||||
|
||||
1. Close stdin to signal end of input
|
||||
2. Send SIGTERM for graceful shutdown
|
||||
3. Wait up to timeout seconds
|
||||
4. SIGKILL if still running
|
||||
5. Always reap process to prevent zombie
|
||||
"""
|
||||
if not process or process.returncode is not None:
|
||||
return # Already terminated
|
||||
|
||||
try:
|
||||
# Close stdin to signal no more input
|
||||
if process.stdin:
|
||||
process.stdin.close()
|
||||
await process.stdin.wait_closed()
|
||||
|
||||
# Send SIGTERM for graceful shutdown
|
||||
process.terminate()
|
||||
|
||||
# Wait for clean exit
|
||||
try:
|
||||
await asyncio.wait_for(process.wait(), timeout=timeout)
|
||||
logger.info(f"Process {process.pid} terminated gracefully")
|
||||
except asyncio.TimeoutError:
|
||||
# Timeout - force kill
|
||||
logger.warning(f"Process {process.pid} did not terminate, sending SIGKILL")
|
||||
process.kill()
|
||||
await process.wait() # CRITICAL: Always reap to prevent zombie
|
||||
logger.info(f"Process {process.pid} killed")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error terminating process: {e}")
|
||||
# Force kill as last resort
|
||||
try:
|
||||
process.kill()
|
||||
await process.wait()
|
||||
except:
|
||||
pass
|
||||
```
|
||||
|
||||
### Pattern 3: Session Resume with --continue
|
||||
**What:** Spawn subprocess with `--continue` flag to restore conversation from `.claude/` state
|
||||
**When to use:** First message after suspension, bot restart resuming active session
|
||||
**Example:**
|
||||
```python
|
||||
# Source: https://code.claude.com/docs/en/cli-reference
|
||||
async def resume_session(session_name: str) -> ClaudeSubprocess:
|
||||
"""
|
||||
Resume suspended session by spawning subprocess with --continue.
|
||||
|
||||
Claude Code automatically loads conversation history from .claude/
|
||||
directory in session folder.
|
||||
"""
|
||||
session_dir = get_session_dir(session_name)
|
||||
persona = load_persona_for_session(session_name)
|
||||
|
||||
# Check if .claude directory exists (has prior conversation)
|
||||
has_history = (session_dir / ".claude").exists()
|
||||
|
||||
cmd = [
|
||||
'claude',
|
||||
'-p',
|
||||
'--input-format', 'stream-json',
|
||||
'--output-format', 'stream-json',
|
||||
'--verbose',
|
||||
'--dangerously-skip-permissions',
|
||||
]
|
||||
|
||||
# Add --continue if session has history
|
||||
if has_history:
|
||||
cmd.append('--continue')
|
||||
logger.info(f"Resuming session '{session_name}' with --continue")
|
||||
else:
|
||||
logger.info(f"Starting fresh session '{session_name}'")
|
||||
|
||||
# Add persona settings (model, system prompt, etc)
|
||||
if persona:
|
||||
settings = persona.get('settings', {})
|
||||
if 'model' in settings:
|
||||
cmd.extend(['--model', settings['model']])
|
||||
if 'system_prompt' in persona:
|
||||
cmd.extend(['--append-system-prompt', persona['system_prompt']])
|
||||
|
||||
# Spawn subprocess
|
||||
subprocess = ClaudeSubprocess(
|
||||
session_dir=session_dir,
|
||||
persona=persona,
|
||||
on_output=...,
|
||||
on_error=...,
|
||||
on_complete=lambda: on_message_complete(session_name),
|
||||
on_status=...,
|
||||
on_tool_use=...,
|
||||
)
|
||||
await subprocess.start()
|
||||
|
||||
return subprocess
|
||||
```
|
||||
|
||||
### Pattern 4: Bot Shutdown with Subprocess Cleanup
|
||||
**What:** Signal handler to cancel all idle timers and terminate all subprocesses on SIGTERM/SIGINT
|
||||
**When to use:** Bot stop, systemctl stop, Ctrl+C
|
||||
**Example:**
|
||||
```python
|
||||
# Source: https://roguelynn.com/words/asyncio-graceful-shutdowns/ +
|
||||
# https://github.com/wbenny/python-graceful-shutdown
|
||||
import signal
|
||||
import asyncio
|
||||
|
||||
async def shutdown(sig: signal.Signals, loop: asyncio.AbstractEventLoop):
|
||||
"""
|
||||
Graceful shutdown handler for bot.
|
||||
|
||||
1. Log signal received
|
||||
2. Cancel all idle timers
|
||||
3. Terminate all subprocesses gracefully
|
||||
4. Cancel all outstanding tasks
|
||||
5. Stop event loop
|
||||
"""
|
||||
logger.info(f"Received exit signal {sig.name}")
|
||||
|
||||
# Cancel all idle timers
|
||||
for timer in idle_timers.values():
|
||||
timer.cancel()
|
||||
|
||||
# Terminate all active subprocesses
|
||||
termination_tasks = []
|
||||
for session_name, subprocess in subprocesses.items():
|
||||
if subprocess.is_alive:
|
||||
logger.info(f"Terminating subprocess for session '{session_name}'")
|
||||
termination_tasks.append(
|
||||
terminate_subprocess_gracefully(subprocess._process, timeout=5)
|
||||
)
|
||||
|
||||
# Wait for all terminations to complete
|
||||
if termination_tasks:
|
||||
await asyncio.gather(*termination_tasks, return_exceptions=True)
|
||||
|
||||
# Cancel all other tasks
|
||||
tasks = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
|
||||
for task in tasks:
|
||||
task.cancel()
|
||||
|
||||
# Wait for cancellation, ignore exceptions
|
||||
await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Stop the loop
|
||||
loop.stop()
|
||||
|
||||
# Install signal handlers on startup
|
||||
def main():
|
||||
app = Application.builder().token(TOKEN).build()
|
||||
|
||||
# Add signal handlers
|
||||
loop = asyncio.get_event_loop()
|
||||
signals = (signal.SIGTERM, signal.SIGINT)
|
||||
for sig in signals:
|
||||
loop.add_signal_handler(
|
||||
sig,
|
||||
lambda s=sig: asyncio.create_task(shutdown(s, loop))
|
||||
)
|
||||
|
||||
# Start bot
|
||||
app.run_polling()
|
||||
```
|
||||
|
||||
### Pattern 5: Session Metadata for Timeout Configuration
|
||||
**What:** Store idle_timeout in session metadata, allow per-session customization via /timeout command
|
||||
**When to use:** Session creation, /timeout command handler
|
||||
**Example:**
|
||||
```python
|
||||
# Session metadata structure
|
||||
{
|
||||
"name": "task-session",
|
||||
"created": "2026-02-04T12:00:00+00:00",
|
||||
"last_active": "2026-02-04T12:30:00+00:00",
|
||||
"persona": "default",
|
||||
"pid": null,
|
||||
"status": "suspended",
|
||||
"idle_timeout": 600 # seconds (10 minutes)
|
||||
}
|
||||
|
||||
# /timeout command handler
|
||||
async def timeout_cmd(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""Set idle timeout for active session."""
|
||||
if not context.args:
|
||||
# Show current timeout
|
||||
active = session_manager.get_active_session()
|
||||
if not active:
|
||||
await update.message.reply_text("No active session")
|
||||
return
|
||||
|
||||
metadata = session_manager.get_session(active)
|
||||
timeout = metadata.get('idle_timeout', 600)
|
||||
await update.message.reply_text(
|
||||
f"Current idle timeout: {timeout // 60} minutes\n\n"
|
||||
f"Usage: /timeout <minutes>"
|
||||
)
|
||||
return
|
||||
|
||||
# Parse timeout value
|
||||
try:
|
||||
minutes = int(context.args[0])
|
||||
if minutes < 1 or minutes > 120:
|
||||
await update.message.reply_text("Timeout must be between 1 and 120 minutes")
|
||||
return
|
||||
|
||||
timeout_seconds = minutes * 60
|
||||
except ValueError:
|
||||
await update.message.reply_text("Invalid number. Usage: /timeout <minutes>")
|
||||
return
|
||||
|
||||
# Update session metadata
|
||||
active = session_manager.get_active_session()
|
||||
session_manager.update_session(active, idle_timeout=timeout_seconds)
|
||||
|
||||
# Restart idle timer with new timeout
|
||||
if active in idle_timers:
|
||||
idle_timers[active].timeout_seconds = timeout_seconds
|
||||
idle_timers[active].reset()
|
||||
|
||||
await update.message.reply_text(f"Idle timeout set to {minutes} minutes")
|
||||
```
|
||||
|
||||
### Pattern 6: /sessions Command with Status Display
|
||||
**What:** List all sessions with name, status, persona, last active time, sorted by activity
|
||||
**When to use:** User wants to see session overview
|
||||
**Example:**
|
||||
```python
|
||||
async def sessions_cmd(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||
"""List all sessions sorted by last activity."""
|
||||
sessions = session_manager.list_sessions()
|
||||
|
||||
if not sessions:
|
||||
await update.message.reply_text("No sessions found. Use /new <name> to create one.")
|
||||
return
|
||||
|
||||
active_session = session_manager.get_active_session()
|
||||
|
||||
# Build formatted list
|
||||
lines = ["*Sessions:*\n"]
|
||||
for session in sessions: # Already sorted by last_active
|
||||
name = session['name']
|
||||
status = session['status']
|
||||
persona = session.get('persona', 'default')
|
||||
last_active = session.get('last_active', 'unknown')
|
||||
|
||||
# Format timestamp
|
||||
try:
|
||||
dt = datetime.fromisoformat(last_active)
|
||||
time_str = dt.strftime('%Y-%m-%d %H:%M')
|
||||
except:
|
||||
time_str = 'unknown'
|
||||
|
||||
# Mark active session
|
||||
marker = "→ " if name == active_session else " "
|
||||
|
||||
# Status emoji
|
||||
emoji = "🟢" if status == "active" else "🔵" if status == "idle" else "⚪"
|
||||
|
||||
lines.append(
|
||||
f"{marker}{emoji} `{name}` ({persona})\n"
|
||||
f" {time_str}"
|
||||
)
|
||||
|
||||
await update.message.reply_text("\n".join(lines), parse_mode='Markdown')
|
||||
```
|
||||
|
||||
### Anti-Patterns to Avoid
|
||||
- **Suspending during processing:** Never suspend while `subprocess.is_busy` is True — will lose in-progress work
|
||||
- **Not resetting timer on user message:** If idle timer only resets on completion, user's message during timeout window gets ignored
|
||||
- **Zombie processes on bot crash:** Without signal handlers, subprocess outlives bot and becomes zombie (orphaned)
|
||||
- **SIGSTOP without resource consideration:** Paused processes hold memory, file handles, network sockets — unsafe for long idle periods
|
||||
- **Shared idle timer for all sessions:** Different sessions have different needs (task vs chat), per-session timeout is more flexible
|
||||
|
||||
## Don't Hand-Roll
|
||||
|
||||
Problems that look simple but have existing solutions:
|
||||
|
||||
| Problem | Don't Build | Use Instead | Why |
|
||||
|---------|-------------|-------------|-----|
|
||||
| Idle timeout detection | Manual timestamp checks in loop | asyncio.Event + asyncio.sleep() | Event-based cancellation is cleaner, no polling overhead |
|
||||
| Graceful shutdown | Just process.terminate() | SIGTERM + timeout + SIGKILL pattern | Prevents zombie processes, handles hung processes |
|
||||
| Per-object timers | Single global timeout thread | asyncio.create_task per session | Native async integration, automatic cleanup |
|
||||
| Resume conversation | Manual state serialization | Claude Code --continue flag | Built-in, tested, handles all edge cases |
|
||||
|
||||
**Key insight:** Process lifecycle management has subtle races (subprocess dies mid-shutdown, signal arrives during cleanup, timer fires after cancellation). Using battle-tested patterns (signal handlers, timeout with fallback, event-based cancellation) prevents these races. Don't reinvent async subprocess management.
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### Pitfall 1: Race Between Timer Fire and User Message
|
||||
**What goes wrong:** Idle timer fires (subprocess terminated), user message arrives during termination, new subprocess spawns, old one still dying — two subprocesses running
|
||||
**Why it happens:** Timer callback and message handler run concurrently. No synchronization between timer firing and subprocess state change.
|
||||
**How to avoid:** Use asyncio.Lock around subprocess state transitions (terminate, spawn). Timer callback acquires lock before terminating, message handler acquires lock before spawning.
|
||||
**Warning signs:** Duplicate responses, sessions becoming unresponsive, "subprocess already running" errors
|
||||
|
||||
```python
|
||||
# WRONG - No synchronization
|
||||
async def on_timeout(session_name):
|
||||
await terminate_subprocess(session_name)
|
||||
|
||||
async def on_message(session_name, message):
|
||||
subprocess = await spawn_subprocess(session_name)
|
||||
await subprocess.send_message(message)
|
||||
|
||||
# RIGHT - Lock around transitions
|
||||
subprocess_locks: dict[str, asyncio.Lock] = {}
|
||||
|
||||
async def on_timeout(session_name):
|
||||
async with subprocess_locks[session_name]:
|
||||
await terminate_subprocess(session_name)
|
||||
|
||||
async def on_message(session_name, message):
|
||||
async with subprocess_locks[session_name]:
|
||||
if not subprocess_exists(session_name):
|
||||
await spawn_subprocess(session_name)
|
||||
await subprocess.send_message(message)
|
||||
```
|
||||
|
||||
### Pitfall 2: Terminating Subprocess During Tool Execution
|
||||
**What goes wrong:** Claude is running a long tool (git clone, npm install), idle timer fires, subprocess terminated mid-operation, corrupted state
|
||||
**Why it happens:** Idle timer only checks elapsed time since last message, doesn't check if subprocess is actively executing tools.
|
||||
**How to avoid:** Track subprocess busy state (`is_busy` flag set during processing). Only start idle timer after `on_complete` callback fires (subprocess is truly idle).
|
||||
**Warning signs:** Corrupted git repos, partial file writes, timeout errors from tools
|
||||
|
||||
```python
|
||||
# WRONG - Timer starts immediately after message send
|
||||
await subprocess.send_message(message)
|
||||
idle_timers[session_name].reset() # Bad: Claude still processing
|
||||
|
||||
# RIGHT - Timer starts after completion
|
||||
await subprocess.send_message(message)
|
||||
# ... subprocess processes, calls tools, emits result event ...
|
||||
# on_complete callback fires
|
||||
async def on_complete():
|
||||
idle_timers[session_name].reset() # Good: Claude is truly idle
|
||||
```
|
||||
|
||||
### Pitfall 3: Not Canceling Idle Timer on Session Switch
|
||||
**What goes wrong:** Switch from session A to session B, session A's timer fires 5 minutes later, terminates session A subprocess (which might have been switched back to)
|
||||
**Why it happens:** Session switch doesn't cancel old session's timer, timer continues running independently
|
||||
**How to avoid:** When switching sessions, don't cancel old timer — let it run. Old subprocess suspends on its own timer. This allows multiple concurrent sessions with independent lifetimes.
|
||||
**Warning signs:** Sessions suspend unexpectedly after switching away and back
|
||||
|
||||
```python
|
||||
# CORRECT - Don't cancel old timer on switch
|
||||
async def switch_session(new_session_name):
|
||||
old_session = get_active_session()
|
||||
|
||||
# Don't touch old session's timer - let it suspend naturally
|
||||
# if old_session in idle_timers:
|
||||
# idle_timers[old_session].cancel() # NO
|
||||
|
||||
set_active_session(new_session_name)
|
||||
|
||||
# Start new session's timer if needed
|
||||
if new_session_name not in idle_timers:
|
||||
# Create timer for new session
|
||||
pass
|
||||
```
|
||||
|
||||
### Pitfall 4: Subprocess Outlives Bot on Crash
|
||||
**What goes wrong:** Bot crashes or is killed with SIGKILL, signal handlers never run, subprocesses become orphans, eat memory/CPU
|
||||
**Why it happens:** SIGKILL can't be caught (by design), no cleanup code runs
|
||||
**How to avoid:** Can't prevent SIGKILL zombies, but minimize with: (1) Store PID in session metadata, check on bot restart, (2) Use systemd with KillMode=control-group to kill all child processes, (3) Bot startup cleanup: scan for orphaned pids from metadata
|
||||
**Warning signs:** Multiple claude processes running after bot restart, memory usage grows over time
|
||||
|
||||
```python
|
||||
# Startup cleanup - kill orphaned subprocesses
|
||||
async def cleanup_orphaned_subprocesses():
|
||||
"""Kill any subprocesses that outlived previous bot run."""
|
||||
sessions = session_manager.list_sessions()
|
||||
|
||||
for session in sessions:
|
||||
pid = session.get('pid')
|
||||
if pid:
|
||||
# Check if process still exists
|
||||
try:
|
||||
os.kill(pid, 0) # Signal 0 = check existence
|
||||
# Process exists - kill it
|
||||
logger.warning(f"Killing orphaned subprocess: PID {pid}")
|
||||
os.kill(pid, signal.SIGTERM)
|
||||
await asyncio.sleep(2)
|
||||
try:
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
except ProcessLookupError:
|
||||
pass # Already dead
|
||||
except ProcessLookupError:
|
||||
pass # Already dead
|
||||
|
||||
# Clear PID from metadata
|
||||
session_manager.update_session(session['name'], pid=None, status='suspended')
|
||||
```
|
||||
|
||||
### Pitfall 5: Storing Stale PIDs in Metadata
|
||||
**What goes wrong:** Session metadata shows pid=12345, but subprocess already terminated. On bot restart, try to kill PID 12345 which is now a different process.
|
||||
**Why it happens:** Subprocess crashes or is manually killed, metadata not updated
|
||||
**How to avoid:** Clear PID from metadata when subprocess terminates (exit code detected). Before killing PID from metadata, verify it's a claude process (check /proc/{pid}/cmdline on Linux).
|
||||
**Warning signs:** Bot kills wrong processes on restart, random crashes
|
||||
|
||||
```python
|
||||
# Safe PID cleanup with verification
|
||||
async def kill_subprocess_by_pid(pid: int):
|
||||
"""Kill subprocess with PID verification."""
|
||||
try:
|
||||
# Verify it's a claude process (Linux-specific)
|
||||
cmdline_path = f"/proc/{pid}/cmdline"
|
||||
if os.path.exists(cmdline_path):
|
||||
with open(cmdline_path) as f:
|
||||
cmdline = f.read()
|
||||
if 'claude' not in cmdline:
|
||||
logger.warning(f"PID {pid} is not a claude process: {cmdline}")
|
||||
return # Don't kill
|
||||
|
||||
# Kill the process
|
||||
os.kill(pid, signal.SIGTERM)
|
||||
await asyncio.sleep(2)
|
||||
try:
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
except ProcessLookupError:
|
||||
pass
|
||||
except ProcessLookupError:
|
||||
pass # Already dead
|
||||
except Exception as e:
|
||||
logger.error(f"Error killing PID {pid}: {e}")
|
||||
```
|
||||
|
||||
## Code Examples
|
||||
|
||||
Verified patterns from official sources:
|
||||
|
||||
### Complete Idle Timer Implementation
|
||||
```python
|
||||
# Source: https://docs.python.org/3/library/asyncio-task.html
|
||||
import asyncio
|
||||
from datetime import datetime, timezone
|
||||
from typing import Callable, Optional
|
||||
|
||||
class SessionIdleTimer:
|
||||
"""
|
||||
Per-session idle timeout manager.
|
||||
|
||||
Tracks last activity, spawns background task to fire after timeout.
|
||||
Cancels and restarts timer on activity (reset).
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
session_name: str,
|
||||
timeout_seconds: int,
|
||||
on_timeout: Callable[[str], None]
|
||||
):
|
||||
"""
|
||||
Args:
|
||||
session_name: Session identifier
|
||||
timeout_seconds: Idle seconds before firing
|
||||
on_timeout: Async callback(session_name) to invoke on timeout
|
||||
"""
|
||||
self.session_name = session_name
|
||||
self.timeout_seconds = timeout_seconds
|
||||
self.on_timeout = on_timeout
|
||||
self._timer_task: Optional[asyncio.Task] = None
|
||||
self._last_activity = datetime.now(timezone.utc)
|
||||
|
||||
def reset(self):
|
||||
"""Reset timer on activity (user message or completion)."""
|
||||
self._last_activity = datetime.now(timezone.utc)
|
||||
|
||||
# Cancel existing timer
|
||||
if self._timer_task and not self._timer_task.done():
|
||||
self._timer_task.cancel()
|
||||
|
||||
# Start fresh timer
|
||||
self._timer_task = asyncio.create_task(self._wait_for_timeout())
|
||||
|
||||
async def _wait_for_timeout(self):
|
||||
"""Background task that waits for timeout duration."""
|
||||
try:
|
||||
await asyncio.sleep(self.timeout_seconds)
|
||||
|
||||
# Timeout reached - invoke callback
|
||||
await self.on_timeout(self.session_name)
|
||||
except asyncio.CancelledError:
|
||||
# Timer was reset by activity
|
||||
pass
|
||||
|
||||
def cancel(self):
|
||||
"""Cancel timer on session shutdown."""
|
||||
if self._timer_task and not self._timer_task.done():
|
||||
self._timer_task.cancel()
|
||||
|
||||
@property
|
||||
def seconds_since_activity(self) -> float:
|
||||
"""Get seconds elapsed since last activity."""
|
||||
delta = datetime.now(timezone.utc) - self._last_activity
|
||||
return delta.total_seconds()
|
||||
```
|
||||
|
||||
### Graceful Subprocess Termination with Timeout
|
||||
```python
|
||||
# Source: https://roguelynn.com/words/asyncio-graceful-shutdowns/
|
||||
import asyncio
|
||||
import signal
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
async def terminate_subprocess_gracefully(
|
||||
process: asyncio.subprocess.Process,
|
||||
timeout: int = 5
|
||||
) -> None:
|
||||
"""
|
||||
Terminate subprocess with graceful shutdown sequence.
|
||||
|
||||
1. Close stdin (signal end of input)
|
||||
2. Send SIGTERM (request graceful shutdown)
|
||||
3. Wait up to timeout seconds
|
||||
4. Send SIGKILL if still running (force kill)
|
||||
5. Always reap process (prevent zombie)
|
||||
|
||||
Args:
|
||||
process: asyncio subprocess to terminate
|
||||
timeout: Seconds to wait before SIGKILL
|
||||
"""
|
||||
if not process or process.returncode is not None:
|
||||
logger.debug("Process already terminated")
|
||||
return
|
||||
|
||||
pid = process.pid
|
||||
logger.info(f"Terminating subprocess PID {pid}")
|
||||
|
||||
try:
|
||||
# Close stdin to signal no more input
|
||||
if process.stdin and not process.stdin.is_closing():
|
||||
process.stdin.close()
|
||||
await process.stdin.wait_closed()
|
||||
|
||||
# Send SIGTERM for graceful exit
|
||||
process.terminate()
|
||||
|
||||
# Wait for clean exit with timeout
|
||||
try:
|
||||
await asyncio.wait_for(process.wait(), timeout=timeout)
|
||||
logger.info(f"Process {pid} terminated gracefully")
|
||||
except asyncio.TimeoutError:
|
||||
# Timeout - force kill
|
||||
logger.warning(f"Process {pid} did not exit within {timeout}s, sending SIGKILL")
|
||||
process.kill()
|
||||
await process.wait() # CRITICAL: Reap to prevent zombie
|
||||
logger.info(f"Process {pid} killed")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error terminating process {pid}: {e}")
|
||||
# Last resort force kill
|
||||
try:
|
||||
process.kill()
|
||||
await process.wait()
|
||||
except:
|
||||
pass
|
||||
```
|
||||
|
||||
### Bot Shutdown Signal Handler
|
||||
```python
|
||||
# Source: https://roguelynn.com/words/asyncio-graceful-shutdowns/ +
|
||||
# https://github.com/wbenny/python-graceful-shutdown
|
||||
import signal
|
||||
import asyncio
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
async def shutdown_handler(
|
||||
sig: signal.Signals,
|
||||
loop: asyncio.AbstractEventLoop,
|
||||
idle_timers: dict,
|
||||
subprocesses: dict
|
||||
):
|
||||
"""
|
||||
Graceful shutdown handler for bot.
|
||||
|
||||
Invoked on SIGTERM/SIGINT to clean up before exit.
|
||||
|
||||
Steps:
|
||||
1. Log signal received
|
||||
2. Cancel all idle timers
|
||||
3. Terminate all subprocesses with timeout
|
||||
4. Cancel all other asyncio tasks
|
||||
5. Stop event loop
|
||||
|
||||
Args:
|
||||
sig: Signal that triggered shutdown
|
||||
loop: Event loop to stop
|
||||
idle_timers: Dict of SessionIdleTimer objects
|
||||
subprocesses: Dict of ClaudeSubprocess objects
|
||||
"""
|
||||
logger.info(f"Received exit signal {sig.name}, initiating graceful shutdown")
|
||||
|
||||
# Step 1: Cancel all idle timers
|
||||
logger.info("Canceling idle timers...")
|
||||
for session_name, timer in idle_timers.items():
|
||||
timer.cancel()
|
||||
|
||||
# Step 2: Terminate all active subprocesses
|
||||
logger.info("Terminating subprocesses...")
|
||||
termination_tasks = []
|
||||
for session_name, subprocess in subprocesses.items():
|
||||
if subprocess.is_alive:
|
||||
logger.info(f"Terminating subprocess for '{session_name}'")
|
||||
termination_tasks.append(
|
||||
terminate_subprocess_gracefully(subprocess._process, timeout=5)
|
||||
)
|
||||
|
||||
# Wait for all terminations (with exceptions handled)
|
||||
if termination_tasks:
|
||||
await asyncio.gather(*termination_tasks, return_exceptions=True)
|
||||
|
||||
# Step 3: Cancel all other asyncio tasks
|
||||
logger.info("Canceling remaining tasks...")
|
||||
tasks = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
|
||||
for task in tasks:
|
||||
task.cancel()
|
||||
|
||||
# Wait for cancellations, ignore exceptions
|
||||
await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Step 4: Stop event loop
|
||||
logger.info("Stopping event loop")
|
||||
loop.stop()
|
||||
|
||||
# Install signal handlers in main()
|
||||
def main():
|
||||
"""Bot entry point with signal handler installation."""
|
||||
app = Application.builder().token(TOKEN).build()
|
||||
|
||||
# Get event loop
|
||||
loop = asyncio.get_event_loop()
|
||||
|
||||
# Install signal handlers for graceful shutdown
|
||||
signals_to_handle = (signal.SIGTERM, signal.SIGINT)
|
||||
for sig in signals_to_handle:
|
||||
loop.add_signal_handler(
|
||||
sig,
|
||||
lambda s=sig: asyncio.create_task(
|
||||
shutdown_handler(s, loop, idle_timers, subprocesses)
|
||||
)
|
||||
)
|
||||
|
||||
logger.info("Signal handlers installed")
|
||||
|
||||
# Start bot
|
||||
app.run_polling()
|
||||
```
|
||||
|
||||
### Session Resume with Status Message
|
||||
```python
|
||||
# Source: https://code.claude.com/docs/en/cli-reference
|
||||
from datetime import datetime, timezone
|
||||
|
||||
async def resume_suspended_session(
|
||||
bot,
|
||||
chat_id: int,
|
||||
session_name: str,
|
||||
message: str
|
||||
) -> None:
|
||||
"""
|
||||
Resume suspended session and send message.
|
||||
|
||||
Sends brief status message to user, spawns subprocess with --continue,
|
||||
sends user's message to Claude.
|
||||
|
||||
Args:
|
||||
bot: Telegram bot instance
|
||||
chat_id: Telegram chat ID
|
||||
session_name: Session to resume
|
||||
message: User message to send after resume
|
||||
"""
|
||||
metadata = session_manager.get_session(session_name)
|
||||
|
||||
# Calculate idle duration
|
||||
last_active = datetime.fromisoformat(metadata['last_active'])
|
||||
now = datetime.now(timezone.utc)
|
||||
idle_minutes = (now - last_active).total_seconds() / 60
|
||||
|
||||
# Send status message
|
||||
if idle_minutes > 1:
|
||||
status_text = f"Resuming session (idle for {int(idle_minutes)} min)..."
|
||||
else:
|
||||
status_text = "Resuming session..."
|
||||
|
||||
await bot.send_message(chat_id=chat_id, text=status_text)
|
||||
|
||||
# Spawn subprocess with --continue
|
||||
session_dir = session_manager.get_session_dir(session_name)
|
||||
persona = load_persona_for_session(session_name)
|
||||
|
||||
callbacks = make_callbacks(bot, chat_id, session_name)
|
||||
|
||||
subprocess = ClaudeSubprocess(
|
||||
session_dir=session_dir,
|
||||
persona=persona,
|
||||
on_output=callbacks['on_output'],
|
||||
on_error=callbacks['on_error'],
|
||||
on_complete=lambda: on_completion(session_name),
|
||||
on_status=callbacks['on_status'],
|
||||
on_tool_use=callbacks['on_tool_use'],
|
||||
)
|
||||
|
||||
await subprocess.start()
|
||||
subprocesses[session_name] = subprocess
|
||||
|
||||
# Update metadata
|
||||
session_manager.update_session(
|
||||
session_name,
|
||||
status='active',
|
||||
last_active=now.isoformat(),
|
||||
pid=subprocess._process.pid
|
||||
)
|
||||
|
||||
# Send user's message
|
||||
await subprocess.send_message(message)
|
||||
|
||||
# Start idle timer
|
||||
timeout = metadata.get('idle_timeout', 600)
|
||||
idle_timers[session_name] = SessionIdleTimer(
|
||||
session_name,
|
||||
timeout,
|
||||
on_timeout=suspend_session
|
||||
)
|
||||
```
|
||||
|
||||
## State of the Art
|
||||
|
||||
| Old Approach | Current Approach | When Changed | Impact |
|
||||
|--------------|------------------|--------------|--------|
|
||||
| Manual timestamp polling | asyncio.Event + asyncio.sleep() | asyncio maturity (2020+) | Cleaner cancellation, no polling overhead |
|
||||
| SIGKILL only | SIGTERM + timeout + SIGKILL fallback | Best practice evolution (2018+) | Prevents zombie processes, allows cleanup |
|
||||
| Global timeout thread | Per-object asyncio tasks | Modern asyncio patterns (2022+) | Per-session configuration, native async integration |
|
||||
| Manual state files | Claude Code --continue with .claude/ | Claude Code 2.0+ (2024) | Built-in, tested, handles edge cases |
|
||||
| SIGSTOP/SIGCONT | SIGTERM + restart | Resource efficiency awareness (ongoing) | Releases memory during idle, safer for long periods |
|
||||
|
||||
**Deprecated/outdated:**
|
||||
- **Thread-based timers for async code:** Mixing threading with asyncio adds complexity, use asyncio.create_task
|
||||
- **Blocking time.sleep() in async context:** Use asyncio.sleep() instead
|
||||
- **Not reaping terminated subprocesses:** Always call process.wait() to prevent zombies
|
||||
|
||||
## Open Questions
|
||||
|
||||
Things that couldn't be fully resolved:
|
||||
|
||||
1. **Optimal default idle timeout**
|
||||
- What we know: Common ranges are 5-15 minutes for chat bots, longer for task automation
|
||||
- What's unclear: What's the sweet spot for balancing memory usage vs restart friction?
|
||||
- Recommendation: Start with 10 minutes default. Allow per-session override via /timeout. Monitor actual usage patterns and adjust.
|
||||
|
||||
2. **SIGSTOP/SIGCONT vs SIGTERM tradeoff**
|
||||
- What we know: SIGSTOP keeps memory but saves restart cost (~1s), SIGTERM releases memory but costs CPU
|
||||
- What's unclear: At what idle duration does memory savings outweigh restart cost?
|
||||
- Recommendation: Use SIGTERM approach. Memory release is more important than 1s restart cost. Claude processes can grow large (100-500MB) with long conversations. SIGSTOP is only beneficial for <5min idle periods.
|
||||
|
||||
3. **Resume status message verbosity**
|
||||
- What we know: User decision says "brief status message on resume"
|
||||
- What's unclear: Should it show idle duration? Session name? Model?
|
||||
- Recommendation: Show idle duration if >1 minute ("Resuming session (idle for 15 min)..."). Don't show session name (user knows what session they messaged). Keep brief.
|
||||
|
||||
4. **Multi-session concurrent subprocess limit**
|
||||
- What we know: Multiple sessions can have live subprocesses simultaneously
|
||||
- What's unclear: Should there be a cap? What if user has 20 sessions all active?
|
||||
- Recommendation: No hard cap initially. Each subprocess uses ~100-500MB. On an 8GB system, 10-20 concurrent sessions is reasonable. Add warning in /sessions if >10 active. Add global concurrent limit (e.g., 15) in Phase 4 if needed.
|
||||
|
||||
5. **Session switch behavior for previous subprocess**
|
||||
- What we know: User decision says "switching leaves previous subprocess running"
|
||||
- What's unclear: Should switching reset the previous session's idle timer?
|
||||
- Recommendation: Don't reset on switch. Previous session's timer continues from last activity. If it was idle for 8 minutes when you switched away, it will suspend in 2 more minutes. This is intuitive — switching doesn't "touch" the old session.
|
||||
|
||||
## Sources
|
||||
|
||||
### Primary (HIGH confidence)
|
||||
- [Coroutines and Tasks - Python 3.14.3 Documentation](https://docs.python.org/3/library/asyncio-task.html) - Official asyncio timeout and task management
|
||||
- [CLI reference - Claude Code Docs](https://code.claude.com/docs/en/cli-reference) - Official Claude Code --continue flag documentation
|
||||
- [Graceful Shutdowns with asyncio - roguelynn](https://roguelynn.com/words/asyncio-graceful-shutdowns/) - Signal handlers and shutdown orchestration
|
||||
- [python-graceful-shutdown - GitHub](https://github.com/wbenny/python-graceful-shutdown) - Complete example of shutdown patterns
|
||||
- [Stopping and Resuming Processes with SIGSTOP and SIGCONT - TheLinuxCode](https://thelinuxcode.com/stop-process-using-sigstop-signal-linux/) - SIGSTOP/SIGCONT behavior and resource tradeoffs
|
||||
|
||||
### Secondary (MEDIUM confidence)
|
||||
- [Session Management - Claude API Docs](https://platform.claude.com/docs/en/agent-sdk/sessions) - Session persistence patterns
|
||||
- [SIGTERM, SIGKILL & SIGSTOP Signals - Medium](https://medium.com/@4techusage/sigterm-sigkill-sigstop-signals-63cb919431e8) - Signal comparison
|
||||
- [A Complete Guide to Timeouts in Python - Better Stack](https://betterstack.com/community/guides/scaling-python/python-timeouts/) - Timeout mechanisms in Python
|
||||
|
||||
### Tertiary (LOW confidence)
|
||||
- WebSearch results on asyncio subprocess management and idle detection patterns - Multiple sources, cross-referenced
|
||||
|
||||
## Metadata
|
||||
|
||||
**Confidence breakdown:**
|
||||
- Standard stack: HIGH - All stdlib components, Claude Code CLI verified
|
||||
- Architecture: HIGH - Patterns based on official asyncio docs and battle-tested libraries
|
||||
- Pitfalls: MEDIUM-HIGH - Common races and edge cases documented, some based on general async patterns rather than lifecycle-specific sources
|
||||
|
||||
**Research date:** 2026-02-04
|
||||
**Valid until:** 2026-03-04 (30 days - asyncio stdlib is stable, Claude Code --continue is established)
|
||||
Loading…
Add table
Reference in a new issue