Mikkel Georgsen 1648a986bc docs: complete project research

Files:
- STACK.md
- FEATURES.md
- ARCHITECTURE.md
- PITFALLS.md
- SUMMARY.md

Key findings:
- Stack: Python 3.12+ with python-telegram-bot 22.6, asyncio subprocess management
- Architecture: Path-based session routing with state machine lifecycle management
- Critical pitfall: Asyncio PIPE deadlock requires concurrent stdout/stderr draining

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-04 13:37:24 +00:00

25 KiB

Raw Blame History

Pitfalls Research

Domain: Telegram Bot + Long-Running CLI Subprocess Management Researched: 2026-02-04 Confidence: HIGH

Critical Pitfalls

Pitfall 1: Asyncio Subprocess PIPE Deadlock

What goes wrong: Using asyncio.create_subprocess_exec with stdout=PIPE and stderr=PIPE causes the subprocess to hang indefinitely when output buffers fill. The parent process awaits proc.wait() while the child blocks writing to the full pipe buffer, creating a classic deadlock. This is especially critical with Claude Code CLI which produces continuous streaming output.

Why it happens: OS pipe buffers are finite (typically 64KB on Linux). When the child process generates more output than the buffer can hold, it blocks on write(). If the parent isn't actively draining the pipe via proc.stdout.read(), the pipe fills and both processes wait forever - child waits for buffer space, parent waits for process exit.

How to avoid:

Use asyncio.create_task() to drain stdout/stderr concurrently while waiting for process
Or use proc.communicate() which handles draining automatically
Or redirect to files instead: stdout=open('log.txt', 'w') to bypass pipe limits
Never call proc.wait() when using PIPE without concurrent reading

Warning signs:

Bot hangs on specific commands that produce verbose output
Process remains in "S" state (sleeping) indefinitely
strace shows both processes blocked on read/write syscalls
Works with short output, hangs with verbose Claude responses

Phase to address: Phase 1: Core Subprocess Management - implement proper async draining patterns before any Claude integration.

Pitfall 2: Telegram API Rate Limit Cascade Failures

What goes wrong: When Claude Code generates output faster than Telegram allows sending (30 messages/second, 20/minute in groups), messages queue up. Without proper backpressure handling, the bot triggers 429 Too Many Requests errors, gets rate-limited for increasing durations (exponential backoff), and eventually the entire message queue fails. Users see partial responses or total silence.

Why it happens: Claude's streaming responses don't know or care about Telegram's rate limits. A single Claude interaction can produce hundreds of lines of output. Naive implementations send each chunk immediately, overwhelming Telegram's API and triggering automatic rate limiting that cascades to ALL bot operations, not just Claude responses.

How to avoid:

Implement message batching: accumulate output for 1-2 seconds before sending
Use telegram.ext.Application's built-in rate limiter (v20.x+)
Add exponential backoff with asyncio.sleep() on 429 errors
Track messages/second and throttle proactively before hitting limits
Consider chunking very long output and offering "download full log" instead

Warning signs:

HTTP 429 errors in logs
Messages arrive in bursts after long delays
Bot becomes unresponsive to ALL commands during Claude sessions
Telegram sends "FloodWait" exceptions with increasing wait times

Phase to address: Phase 2: Telegram Integration - must be solved before exposing Claude streaming output to users.

Pitfall 3: Zombie Process Accumulation

What goes wrong: When the bot crashes, restarts, or processes are killed improperly, Claude Code subprocesses become zombies - still running, consuming resources, but detached from parent. On a 4GB LXC container, a few zombie processes can exhaust memory. After days/weeks, dozens of orphaned Claude processes pile up.

Why it happens: Python's asyncio doesn't automatically clean up child processes on exception or when event loop closes. Calling proc.kill() without await proc.wait() leaves process in zombie state. systemd restarts don't adopt orphaned children. The Telegram bot's event loop may close while subprocesses are mid-execution.

How to avoid:

Always await proc.wait() after termination signals
Use try/finally to ensure cleanup even on exceptions
Configure systemd KillMode=control-group to kill entire process tree on restart
Implement graceful shutdown handler that waits for all subprocesses
Use process tracking: maintain dict of active PIDs, verify cleanup on startup

Warning signs:

ps aux | grep claude shows processes with different PPIDs or PPID=1
Memory usage creeps up over days without corresponding active sessions
Process count increases but active users count doesn't
defunct or <zombie> processes in process table

Phase to address: Phase 1: Core Subprocess Management - proper lifecycle management must be foundational.

Pitfall 4: Session State Corruption via Race Conditions

What goes wrong: When a user sends multiple Telegram messages rapidly while Claude is processing, concurrent writes to the session state file corrupt data. Session JSON becomes malformed, context is lost, Claude forgets conversation history mid-interaction. In worst case, file locking fails and two processes write simultaneously, producing invalid JSON that crashes the bot.

Why it happens: Telegram's async handlers run concurrently. Message 1 starts Claude subprocess, Message 2 arrives before Message 1 finishes, both try to update sessions/{user_id}.json. Python's file I/O isn't atomic - one write can partially overwrite another. json.dump() + f.write() is not atomic across asyncio tasks.

How to avoid:

Use asyncio.Lock per user: user_locks[user_id] ensures serial access to session state
Or use filelock library for cross-process file locking
Implement atomic writes: write to temp file, then os.rename() (atomic on POSIX)
Queue user messages: new message while Claude active goes to pending queue, processed after current finishes
Detect corruption: catch json.JSONDecodeError on read, backup corrupted file, start fresh session

Warning signs:

json.JSONDecodeError in logs
Users report "bot forgot our conversation"
Sporadic failures only when users type quickly
Session files contain partial/mixed JSON from multiple writes
File size is unexpectedly small (truncation during write)

Phase to address: Phase 3: Session Management - after basic subprocess handling works, before multi-user testing.

Pitfall 5: Claude Code CLI --resume Footgun

What goes wrong: Using --resume flag naively to continue sessions seems ideal, but leads to state divergence. The CLI's internal state (transcript, tool outputs, context window) drifts from what the bot thinks happened. Bot displays response A to user, but Claude's transcript shows response B due to regeneration during resume. Messages appear out of order or duplicated.

Why it happens: --resume replays the transcript from disk and may regenerate responses if conditions changed (model version updated, non-deterministic sampling). The bot's session state stores "what we showed the user", but Claude's resumed state reflects "what actually happened in the transcript". These diverge over time, especially with tool use where results may differ on replay.

How to avoid:

Avoid --resume entirely: start fresh subprocess per interaction, pass conversation history via stdin
Or implement "resume detection": compare Claude's first message after resume with expected cached response, warn on mismatch
Or treat --resume as read-only: use it to show transcript to user, but always start fresh for new input
Store transcript path in session state, verify hash/checksum before resume to detect corruption

Warning signs:

Users see repeated messages they already received
Bot shows different response than what Claude transcript contains
Tool use executes twice with different results
Resume succeeds but conversation context is wrong

Phase to address: Phase 4: Resume/Persistence - only after basic interaction flow is solid, requires deep understanding of transcript format.

Pitfall 6: Idle Timeout Race Condition

What goes wrong: Implementing "kill Claude after N minutes idle" creates a race: user sends message at T+599s, timeout fires at T+600s, both try to access the subprocess. Timeout calls proc.kill() while message handler calls proc.stdin.write(). Result: BrokenPipeError, message lost, user sees error instead of Claude response. In worse case, timeout cleanup runs mid-response, truncating output.

Why it happens: Asyncio's asyncio.wait_for() and timeout tasks don't coordinate with message arrival. The timeout coroutine has no knowledge that a new message just started processing. Both coroutines operate on shared subprocess state without synchronization. Telegram's async handlers run immediately on message arrival, possibly overlapping with timeout logic.

How to avoid:

Cancel timeout task BEFORE starting message processing: timeout_task.cancel() in message handler
Use asyncio.Lock to prevent timeout cleanup during active message handling
Implement "last activity" timestamp: timeout checks timestamp and skips cleanup if recent
Set timeout generously (10min+) to reduce race window
Log timeout decisions: "Killing process for user X due to idle since Y" helps debug races

Warning signs:

Intermittent BrokenPipeError or ValueError: I/O operation on closed file
Happens more often exactly at timeout threshold (e.g., always near 5min mark)
Users report "bot randomly stops responding" mid-conversation
Logs show process killed, then immediately new message arrives
Error rate correlates with idle timeout duration

Phase to address: Phase 5: Idle Management - only add after core interaction loop is bulletproof, requires careful async coordination.

Pitfall 7: Cost Runaway from Failed Haiku Handoff

What goes wrong: The plan is to use Haiku for light tasks, escalate to Opus for complex reasoning. But if escalation logic fails (Haiku doesn't recognize complexity, or handoff mechanism breaks), every request goes to Opus. A user asks 100 simple questions ("what's the weather?") and you burn through $25 in token costs instead of $1. Monthly bill explodes from $50 to $500.

Why it happens: Model routing is fragile: Haiku's job is to decide "do I need Opus?" but it may be too dumb to know when it's too dumb. Complexity heuristics (token count, tool use, keywords) have false negatives. Bugs in handoff code (wrong model parameter, API error) cause fallback to default model (often the expensive one). No budget enforcement means runaway costs go unnoticed until the bill arrives.

How to avoid:

Implement per-user daily/monthly cost caps: track tokens used, reject requests over limit
Log every model decision: "User X, message Y: using Haiku because Z" for audit trail
Monitor cost metrics in real-time: alert if hourly spend exceeds threshold
Start with Haiku-only, add Opus escalation LATER once metrics show handoff works
Use prompt engineering: system prompt tells Haiku "If you're uncertain, say 'I need help' instead of trying"
Test escalation logic extensively with edge cases before production

Warning signs:

Anthropic usage dashboard shows 90%+ Opus when expecting 80%+ Haiku
Daily spend consistently above projected average
Logs show no/few Haiku->Opus escalation events (suggests routing broken)
Users report slow responses (Opus is slower) when they expected fast replies
Cost-per-interaction metric increases over time without feature changes

Phase to address: Phase 6: Cost Optimization - start Haiku-only in Phase 2, defer Opus handoff until usage patterns are understood.

Technical Debt Patterns

Shortcuts that seem reasonable but create long-term problems.

Shortcut	Immediate Benefit	Long-term Cost	When Acceptable
Using `subprocess.run()` instead of asyncio subprocess	Simpler code, no async complexity	Blocks event loop, bot unresponsive during Claude calls, Telegram timeouts	Never - breaks async bot entirely
Storing session state in memory only (no persistence)	Fast, no file I/O, no corruption risk	Sessions lost on restart, can't implement --resume, no audit trail	MVP only - add persistence by Phase 3
Single global Claude subprocess for all users	Simple: one process to manage, no spawn overhead	Security nightmare (cross-user context leak), single point of failure, no isolation	Never - violates basic security
No cost tracking, assume Haiku is cheap enough	Faster development, less code	Budget surprises, no visibility into usage patterns, can't optimize	Early testing only - add tracking by Phase 2 GA
Sending full stdout line-by-line to Telegram	Simple: `for line in stdout`, looks responsive	Rate limiting, message spam, user annoyance, API costs	Never - batch messages or stream differently
Killing process with `SIGKILL` instead of graceful shutdown	Reliable: process always dies immediately	No cleanup, zombie risk, corrupted state, tool operations interrupted	Emergency fallback only - use `SIGTERM` first

Integration Gotchas

Common mistakes when connecting to external services.

Integration	Common Mistake	Correct Approach
Claude Code CLI	Assuming stdout contains only assistant messages	Parse JSON-lines protocol: distinguish between message types (assistant, tool, control), filter accordingly
Claude Code CLI	Using interactive mode (no --stdin)	Always use `--stdin` flag for programmatic control, never rely on terminal interaction
Telegram python-telegram-bot	Calling blocking functions in async handlers	Use `asyncio.to_thread()` for sync code, or use async subprocess APIs
Telegram API	Assuming message sends succeed	Handle `telegram.error.RetryAfter` (rate limit), `NetworkError` (connectivity), retry with exponential backoff
systemd service	Relying on `Type=simple` with asyncio	Use `Type=exec` or `Type=notify` to ensure systemd knows when service is ready, prevents premature "active" status
File system (inbox, sessions)	Concurrent read/write without locking	Use `filelock` library or `asyncio.Lock` for critical sections, ensure atomic operations

Performance Traps

Patterns that work at small scale but fail as usage grows.

Trap	Symptoms	Prevention	When It Breaks
One subprocess per message (spawn overhead)	High CPU during bursts, slow response time	Reuse subprocess across messages in same session, only spawn once per user interaction thread	>10 messages/minute per user
Loading full transcript on every message	Increasing latency as conversation grows	Implement transcript pagination, only load recent context + summary	>100 messages per session (~50KB transcript)
Synchronous file writes to session state	Bot lag spikes during saves, Telegram timeouts	Use async file I/O (`aiofiles`) or offload to background task	>5 concurrent users writing state
Unbounded message queue per user	Memory grows without limit if Claude is slow	Implement queue size limit (e.g., 10 pending messages), reject new messages when full	User sends >20 messages while waiting
Regex parsing of Claude output line-by-line	CPU spikes with verbose responses	Parse once per message chunk, not per line; use JSON protocol when possible	Claude outputs >1000 lines
Keeping all session objects in memory	Works fine... until OOM	Implement LRU cache with max size, evict inactive sessions after timeout	>50 concurrent sessions on 4GB RAM

Security Mistakes

Domain-specific security issues beyond general web security.

Mistake	Risk	Prevention
Trusting Telegram user_id without verification	Malicious user spoofs authorized user ID via API	Check `authorized_users` file on EVERY command, validate against Telegram's cryptographic signatures
Passing user input directly to subprocess args	Command injection: user sends `/ping; rm -rf /`	Strict input validation, use shlex.quote(), never use shell=True
Exposing Claude Code's file system access to users	User asks Claude "read /etc/shadow", Claude complies	Implement tool use filtering, whitelist allowed paths, run Claude subprocess in restricted namespace
Storing Telegram bot token in code or world-readable file	Token leak allows full bot takeover	Store in `credentials` file with 600 permissions, never commit to git
No rate limiting on expensive operations	DoS: user spams bot with Claude requests until OOM/cost limit	Per-user rate limit (e.g., 10 messages/hour), queue depth limit, kill runaway processes
Logging sensitive data (messages, API keys)	Log leakage exposes private conversations	Redact message content in logs, only log metadata (user_id, timestamp, status)

UX Pitfalls

Common user experience mistakes in this domain.

Pitfall	User Impact	Better Approach
No feedback while Claude thinks	User waits in silence, assumes bot is broken	Send "Claude is thinking..." immediately, update with "..." every 5s, show typing indicator
Dumping full Claude output as single 4000-char message	Wall of text, hard to read, loses context	Split into logical chunks (by paragraph/section), send as multiple messages with slight delay
No way to stop runaway Claude response	User watches helplessly as bot spams hundreds of lines	Implement `/stop` command, show progress "Sending response X/Y", allow cancellation
Silent failures	Message disappears into void, no error message	Always confirm receipt: "Got it, processing..." or "Error: rate limit, try again"
No context on what Claude knows	User confused why bot remembers/forgets things	Show session state: "Session started 10 min ago, 5 messages" or "New session (use /resume to continue)"
Cryptic error messages from Claude subprocess	"Error: exit code 1" means nothing to user	Parse Claude's stderr, translate to user-friendly: "Claude encountered an error: [specific reason]"

"Looks Done But Isn't" Checklist

Things that appear complete but are missing critical pieces.

Subprocess cleanup: Often missing await proc.wait() after kill - verify all code paths call wait()
Error handling on Telegram API: Often missing retry logic on 429/5xx - verify every await bot.send_*() has try/except
File locking for session state: Often missing locks on concurrent read/modify/write - verify atomicity with filelock tests
Graceful shutdown: Often missing SIGTERM handler - verify systemd restart doesn't leave zombies via ps aux check
Cost tracking: Often logs tokens but doesn't enforce limits - verify limit exceeded actually rejects requests
Idle timeout cancellation: Often sets timeout but forgets to cancel on new activity - test rapid message burst at T+timeout-1s
Output buffering/draining: Often uses PIPE but forgets to drain - test with verbose Claude output (>100KB)
Model selection logging: Often switches models but doesn't log decision - verify audit trail shows which model was used and why

Recovery Strategies

When pitfalls occur despite prevention, how to recover.

Pitfall	Recovery Cost	Recovery Steps
Deadlocked subprocess	LOW	Detect via timeout on `proc.wait()`, send SIGKILL, cleanup session state, notify user "session crashed, please retry"
Zombie process accumulation	LOW	Scan for zombies on startup (`ps -eo pid,ppid,stat,cmd`), kill all matching Claude processes, clear stale session files
Corrupted session state	LOW	Catch `json.JSONDecodeError`, backup corrupted file to `sessions/corrupted/{user_id}_{timestamp}.json`, start fresh session
Rate limit cascade	MEDIUM	Pause all message processing for backoff duration (from 429 response), queue incoming messages, resume when limit resets
Cost runaway	MEDIUM	Detect via Anthropic API usage endpoint, auto-disable bot, send alert, manual review before re-enable
State divergence (--resume)	HIGH	Compare expected vs actual transcript hash on resume, reject resume if mismatch, fallback to fresh session with context summary
Race condition on timeout	LOW	Log all process lifecycle events, correlate timestamps to identify race, fix with locking, restart affected user sessions

Pitfall-to-Phase Mapping

How roadmap phases should address these pitfalls.

Pitfall	Prevention Phase	Verification
Asyncio subprocess PIPE deadlock	Phase 1: Core Subprocess	Test with synthetic >64KB output, verify no hang
Telegram rate limit cascade	Phase 2: Telegram Integration	Stress test: send 100 rapid messages, verify batching/throttling works
Zombie process accumulation	Phase 1: Core Subprocess	Kill bot during active Claude call, restart, verify no zombies via `ps aux`
Session state corruption	Phase 3: Session Management	Test concurrent message bombardment (10 messages in 1s), verify state integrity
Claude Code --resume footgun	Phase 4: Resume/Persistence	Resume session, compare transcript hash, verify no divergence
Idle timeout race condition	Phase 5: Idle Management	Send message at T+timeout-1s, verify no BrokenPipeError
Cost runaway from failed Haiku handoff	Phase 6: Cost Optimization	Simulate 100 requests, verify model distribution matches expectations (80% Haiku)

Sources

Telegram Bot + Subprocess Management:

Asyncio Subprocess Pitfalls:

Zombie Processes:

Telegram API Limits:

Claude Code CLI Protocol:

systemd Process Management:

Python Asyncio Memory Leaks:

Race Conditions & Concurrency:

Claude API Cost Optimization:

Pitfalls research for: Telegram-to-Claude Code bridge (brownfield Python bot extension) Researched: 2026-02-04

25 KiB Raw Blame History

Pitfalls Research

Critical Pitfalls

Pitfall 1: Asyncio Subprocess PIPE Deadlock

Pitfall 2: Telegram API Rate Limit Cascade Failures

Pitfall 3: Zombie Process Accumulation

Pitfall 4: Session State Corruption via Race Conditions

Pitfall 5: Claude Code CLI --resume Footgun

Pitfall 6: Idle Timeout Race Condition

Pitfall 7: Cost Runaway from Failed Haiku Handoff

Technical Debt Patterns

Integration Gotchas

Performance Traps

Security Mistakes

UX Pitfalls

"Looks Done But Isn't" Checklist

Recovery Strategies

Pitfall-to-Phase Mapping

Sources

25 KiB

Raw Blame History