homelab/.planning/research/PITFALLS.md
Mikkel Georgsen 1648a986bc docs: complete project research
Files:
- STACK.md
- FEATURES.md
- ARCHITECTURE.md
- PITFALLS.md
- SUMMARY.md

Key findings:
- Stack: Python 3.12+ with python-telegram-bot 22.6, asyncio subprocess management
- Architecture: Path-based session routing with state machine lifecycle management
- Critical pitfall: Asyncio PIPE deadlock requires concurrent stdout/stderr draining

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 13:37:24 +00:00

25 KiB

Pitfalls Research

Domain: Telegram Bot + Long-Running CLI Subprocess Management Researched: 2026-02-04 Confidence: HIGH

Critical Pitfalls

Pitfall 1: Asyncio Subprocess PIPE Deadlock

What goes wrong: Using asyncio.create_subprocess_exec with stdout=PIPE and stderr=PIPE causes the subprocess to hang indefinitely when output buffers fill. The parent process awaits proc.wait() while the child blocks writing to the full pipe buffer, creating a classic deadlock. This is especially critical with Claude Code CLI which produces continuous streaming output.

Why it happens: OS pipe buffers are finite (typically 64KB on Linux). When the child process generates more output than the buffer can hold, it blocks on write(). If the parent isn't actively draining the pipe via proc.stdout.read(), the pipe fills and both processes wait forever - child waits for buffer space, parent waits for process exit.

How to avoid:

  • Use asyncio.create_task() to drain stdout/stderr concurrently while waiting for process
  • Or use proc.communicate() which handles draining automatically
  • Or redirect to files instead: stdout=open('log.txt', 'w') to bypass pipe limits
  • Never call proc.wait() when using PIPE without concurrent reading

Warning signs:

  • Bot hangs on specific commands that produce verbose output
  • Process remains in "S" state (sleeping) indefinitely
  • strace shows both processes blocked on read/write syscalls
  • Works with short output, hangs with verbose Claude responses

Phase to address: Phase 1: Core Subprocess Management - implement proper async draining patterns before any Claude integration.


Pitfall 2: Telegram API Rate Limit Cascade Failures

What goes wrong: When Claude Code generates output faster than Telegram allows sending (30 messages/second, 20/minute in groups), messages queue up. Without proper backpressure handling, the bot triggers 429 Too Many Requests errors, gets rate-limited for increasing durations (exponential backoff), and eventually the entire message queue fails. Users see partial responses or total silence.

Why it happens: Claude's streaming responses don't know or care about Telegram's rate limits. A single Claude interaction can produce hundreds of lines of output. Naive implementations send each chunk immediately, overwhelming Telegram's API and triggering automatic rate limiting that cascades to ALL bot operations, not just Claude responses.

How to avoid:

  • Implement message batching: accumulate output for 1-2 seconds before sending
  • Use telegram.ext.Application's built-in rate limiter (v20.x+)
  • Add exponential backoff with asyncio.sleep() on 429 errors
  • Track messages/second and throttle proactively before hitting limits
  • Consider chunking very long output and offering "download full log" instead

Warning signs:

  • HTTP 429 errors in logs
  • Messages arrive in bursts after long delays
  • Bot becomes unresponsive to ALL commands during Claude sessions
  • Telegram sends "FloodWait" exceptions with increasing wait times

Phase to address: Phase 2: Telegram Integration - must be solved before exposing Claude streaming output to users.


Pitfall 3: Zombie Process Accumulation

What goes wrong: When the bot crashes, restarts, or processes are killed improperly, Claude Code subprocesses become zombies - still running, consuming resources, but detached from parent. On a 4GB LXC container, a few zombie processes can exhaust memory. After days/weeks, dozens of orphaned Claude processes pile up.

Why it happens: Python's asyncio doesn't automatically clean up child processes on exception or when event loop closes. Calling proc.kill() without await proc.wait() leaves process in zombie state. systemd restarts don't adopt orphaned children. The Telegram bot's event loop may close while subprocesses are mid-execution.

How to avoid:

  • Always await proc.wait() after termination signals
  • Use try/finally to ensure cleanup even on exceptions
  • Configure systemd KillMode=control-group to kill entire process tree on restart
  • Implement graceful shutdown handler that waits for all subprocesses
  • Use process tracking: maintain dict of active PIDs, verify cleanup on startup

Warning signs:

  • ps aux | grep claude shows processes with different PPIDs or PPID=1
  • Memory usage creeps up over days without corresponding active sessions
  • Process count increases but active users count doesn't
  • defunct or <zombie> processes in process table

Phase to address: Phase 1: Core Subprocess Management - proper lifecycle management must be foundational.


Pitfall 4: Session State Corruption via Race Conditions

What goes wrong: When a user sends multiple Telegram messages rapidly while Claude is processing, concurrent writes to the session state file corrupt data. Session JSON becomes malformed, context is lost, Claude forgets conversation history mid-interaction. In worst case, file locking fails and two processes write simultaneously, producing invalid JSON that crashes the bot.

Why it happens: Telegram's async handlers run concurrently. Message 1 starts Claude subprocess, Message 2 arrives before Message 1 finishes, both try to update sessions/{user_id}.json. Python's file I/O isn't atomic - one write can partially overwrite another. json.dump() + f.write() is not atomic across asyncio tasks.

How to avoid:

  • Use asyncio.Lock per user: user_locks[user_id] ensures serial access to session state
  • Or use filelock library for cross-process file locking
  • Implement atomic writes: write to temp file, then os.rename() (atomic on POSIX)
  • Queue user messages: new message while Claude active goes to pending queue, processed after current finishes
  • Detect corruption: catch json.JSONDecodeError on read, backup corrupted file, start fresh session

Warning signs:

  • json.JSONDecodeError in logs
  • Users report "bot forgot our conversation"
  • Sporadic failures only when users type quickly
  • Session files contain partial/mixed JSON from multiple writes
  • File size is unexpectedly small (truncation during write)

Phase to address: Phase 3: Session Management - after basic subprocess handling works, before multi-user testing.


Pitfall 5: Claude Code CLI --resume Footgun

What goes wrong: Using --resume flag naively to continue sessions seems ideal, but leads to state divergence. The CLI's internal state (transcript, tool outputs, context window) drifts from what the bot thinks happened. Bot displays response A to user, but Claude's transcript shows response B due to regeneration during resume. Messages appear out of order or duplicated.

Why it happens: --resume replays the transcript from disk and may regenerate responses if conditions changed (model version updated, non-deterministic sampling). The bot's session state stores "what we showed the user", but Claude's resumed state reflects "what actually happened in the transcript". These diverge over time, especially with tool use where results may differ on replay.

How to avoid:

  • Avoid --resume entirely: start fresh subprocess per interaction, pass conversation history via stdin
  • Or implement "resume detection": compare Claude's first message after resume with expected cached response, warn on mismatch
  • Or treat --resume as read-only: use it to show transcript to user, but always start fresh for new input
  • Store transcript path in session state, verify hash/checksum before resume to detect corruption

Warning signs:

  • Users see repeated messages they already received
  • Bot shows different response than what Claude transcript contains
  • Tool use executes twice with different results
  • Resume succeeds but conversation context is wrong

Phase to address: Phase 4: Resume/Persistence - only after basic interaction flow is solid, requires deep understanding of transcript format.


Pitfall 6: Idle Timeout Race Condition

What goes wrong: Implementing "kill Claude after N minutes idle" creates a race: user sends message at T+599s, timeout fires at T+600s, both try to access the subprocess. Timeout calls proc.kill() while message handler calls proc.stdin.write(). Result: BrokenPipeError, message lost, user sees error instead of Claude response. In worse case, timeout cleanup runs mid-response, truncating output.

Why it happens: Asyncio's asyncio.wait_for() and timeout tasks don't coordinate with message arrival. The timeout coroutine has no knowledge that a new message just started processing. Both coroutines operate on shared subprocess state without synchronization. Telegram's async handlers run immediately on message arrival, possibly overlapping with timeout logic.

How to avoid:

  • Cancel timeout task BEFORE starting message processing: timeout_task.cancel() in message handler
  • Use asyncio.Lock to prevent timeout cleanup during active message handling
  • Implement "last activity" timestamp: timeout checks timestamp and skips cleanup if recent
  • Set timeout generously (10min+) to reduce race window
  • Log timeout decisions: "Killing process for user X due to idle since Y" helps debug races

Warning signs:

  • Intermittent BrokenPipeError or ValueError: I/O operation on closed file
  • Happens more often exactly at timeout threshold (e.g., always near 5min mark)
  • Users report "bot randomly stops responding" mid-conversation
  • Logs show process killed, then immediately new message arrives
  • Error rate correlates with idle timeout duration

Phase to address: Phase 5: Idle Management - only add after core interaction loop is bulletproof, requires careful async coordination.


Pitfall 7: Cost Runaway from Failed Haiku Handoff

What goes wrong: The plan is to use Haiku for light tasks, escalate to Opus for complex reasoning. But if escalation logic fails (Haiku doesn't recognize complexity, or handoff mechanism breaks), every request goes to Opus. A user asks 100 simple questions ("what's the weather?") and you burn through $25 in token costs instead of $1. Monthly bill explodes from $50 to $500.

Why it happens: Model routing is fragile: Haiku's job is to decide "do I need Opus?" but it may be too dumb to know when it's too dumb. Complexity heuristics (token count, tool use, keywords) have false negatives. Bugs in handoff code (wrong model parameter, API error) cause fallback to default model (often the expensive one). No budget enforcement means runaway costs go unnoticed until the bill arrives.

How to avoid:

  • Implement per-user daily/monthly cost caps: track tokens used, reject requests over limit
  • Log every model decision: "User X, message Y: using Haiku because Z" for audit trail
  • Monitor cost metrics in real-time: alert if hourly spend exceeds threshold
  • Start with Haiku-only, add Opus escalation LATER once metrics show handoff works
  • Use prompt engineering: system prompt tells Haiku "If you're uncertain, say 'I need help' instead of trying"
  • Test escalation logic extensively with edge cases before production

Warning signs:

  • Anthropic usage dashboard shows 90%+ Opus when expecting 80%+ Haiku
  • Daily spend consistently above projected average
  • Logs show no/few Haiku->Opus escalation events (suggests routing broken)
  • Users report slow responses (Opus is slower) when they expected fast replies
  • Cost-per-interaction metric increases over time without feature changes

Phase to address: Phase 6: Cost Optimization - start Haiku-only in Phase 2, defer Opus handoff until usage patterns are understood.


Technical Debt Patterns

Shortcuts that seem reasonable but create long-term problems.

Shortcut Immediate Benefit Long-term Cost When Acceptable
Using subprocess.run() instead of asyncio subprocess Simpler code, no async complexity Blocks event loop, bot unresponsive during Claude calls, Telegram timeouts Never - breaks async bot entirely
Storing session state in memory only (no persistence) Fast, no file I/O, no corruption risk Sessions lost on restart, can't implement --resume, no audit trail MVP only - add persistence by Phase 3
Single global Claude subprocess for all users Simple: one process to manage, no spawn overhead Security nightmare (cross-user context leak), single point of failure, no isolation Never - violates basic security
No cost tracking, assume Haiku is cheap enough Faster development, less code Budget surprises, no visibility into usage patterns, can't optimize Early testing only - add tracking by Phase 2 GA
Sending full stdout line-by-line to Telegram Simple: for line in stdout, looks responsive Rate limiting, message spam, user annoyance, API costs Never - batch messages or stream differently
Killing process with SIGKILL instead of graceful shutdown Reliable: process always dies immediately No cleanup, zombie risk, corrupted state, tool operations interrupted Emergency fallback only - use SIGTERM first

Integration Gotchas

Common mistakes when connecting to external services.

Integration Common Mistake Correct Approach
Claude Code CLI Assuming stdout contains only assistant messages Parse JSON-lines protocol: distinguish between message types (assistant, tool, control), filter accordingly
Claude Code CLI Using interactive mode (no --stdin) Always use --stdin flag for programmatic control, never rely on terminal interaction
Telegram python-telegram-bot Calling blocking functions in async handlers Use asyncio.to_thread() for sync code, or use async subprocess APIs
Telegram API Assuming message sends succeed Handle telegram.error.RetryAfter (rate limit), NetworkError (connectivity), retry with exponential backoff
systemd service Relying on Type=simple with asyncio Use Type=exec or Type=notify to ensure systemd knows when service is ready, prevents premature "active" status
File system (inbox, sessions) Concurrent read/write without locking Use filelock library or asyncio.Lock for critical sections, ensure atomic operations

Performance Traps

Patterns that work at small scale but fail as usage grows.

Trap Symptoms Prevention When It Breaks
One subprocess per message (spawn overhead) High CPU during bursts, slow response time Reuse subprocess across messages in same session, only spawn once per user interaction thread >10 messages/minute per user
Loading full transcript on every message Increasing latency as conversation grows Implement transcript pagination, only load recent context + summary >100 messages per session (~50KB transcript)
Synchronous file writes to session state Bot lag spikes during saves, Telegram timeouts Use async file I/O (aiofiles) or offload to background task >5 concurrent users writing state
Unbounded message queue per user Memory grows without limit if Claude is slow Implement queue size limit (e.g., 10 pending messages), reject new messages when full User sends >20 messages while waiting
Regex parsing of Claude output line-by-line CPU spikes with verbose responses Parse once per message chunk, not per line; use JSON protocol when possible Claude outputs >1000 lines
Keeping all session objects in memory Works fine... until OOM Implement LRU cache with max size, evict inactive sessions after timeout >50 concurrent sessions on 4GB RAM

Security Mistakes

Domain-specific security issues beyond general web security.

Mistake Risk Prevention
Trusting Telegram user_id without verification Malicious user spoofs authorized user ID via API Check authorized_users file on EVERY command, validate against Telegram's cryptographic signatures
Passing user input directly to subprocess args Command injection: user sends /ping; rm -rf / Strict input validation, use shlex.quote(), never use shell=True
Exposing Claude Code's file system access to users User asks Claude "read /etc/shadow", Claude complies Implement tool use filtering, whitelist allowed paths, run Claude subprocess in restricted namespace
Storing Telegram bot token in code or world-readable file Token leak allows full bot takeover Store in credentials file with 600 permissions, never commit to git
No rate limiting on expensive operations DoS: user spams bot with Claude requests until OOM/cost limit Per-user rate limit (e.g., 10 messages/hour), queue depth limit, kill runaway processes
Logging sensitive data (messages, API keys) Log leakage exposes private conversations Redact message content in logs, only log metadata (user_id, timestamp, status)

UX Pitfalls

Common user experience mistakes in this domain.

Pitfall User Impact Better Approach
No feedback while Claude thinks User waits in silence, assumes bot is broken Send "Claude is thinking..." immediately, update with "..." every 5s, show typing indicator
Dumping full Claude output as single 4000-char message Wall of text, hard to read, loses context Split into logical chunks (by paragraph/section), send as multiple messages with slight delay
No way to stop runaway Claude response User watches helplessly as bot spams hundreds of lines Implement /stop command, show progress "Sending response X/Y", allow cancellation
Silent failures Message disappears into void, no error message Always confirm receipt: "Got it, processing..." or "Error: rate limit, try again"
No context on what Claude knows User confused why bot remembers/forgets things Show session state: "Session started 10 min ago, 5 messages" or "New session (use /resume to continue)"
Cryptic error messages from Claude subprocess "Error: exit code 1" means nothing to user Parse Claude's stderr, translate to user-friendly: "Claude encountered an error: [specific reason]"

"Looks Done But Isn't" Checklist

Things that appear complete but are missing critical pieces.

  • Subprocess cleanup: Often missing await proc.wait() after kill - verify all code paths call wait()
  • Error handling on Telegram API: Often missing retry logic on 429/5xx - verify every await bot.send_*() has try/except
  • File locking for session state: Often missing locks on concurrent read/modify/write - verify atomicity with filelock tests
  • Graceful shutdown: Often missing SIGTERM handler - verify systemd restart doesn't leave zombies via ps aux check
  • Cost tracking: Often logs tokens but doesn't enforce limits - verify limit exceeded actually rejects requests
  • Idle timeout cancellation: Often sets timeout but forgets to cancel on new activity - test rapid message burst at T+timeout-1s
  • Output buffering/draining: Often uses PIPE but forgets to drain - test with verbose Claude output (>100KB)
  • Model selection logging: Often switches models but doesn't log decision - verify audit trail shows which model was used and why

Recovery Strategies

When pitfalls occur despite prevention, how to recover.

Pitfall Recovery Cost Recovery Steps
Deadlocked subprocess LOW Detect via timeout on proc.wait(), send SIGKILL, cleanup session state, notify user "session crashed, please retry"
Zombie process accumulation LOW Scan for zombies on startup (ps -eo pid,ppid,stat,cmd), kill all matching Claude processes, clear stale session files
Corrupted session state LOW Catch json.JSONDecodeError, backup corrupted file to sessions/corrupted/{user_id}_{timestamp}.json, start fresh session
Rate limit cascade MEDIUM Pause all message processing for backoff duration (from 429 response), queue incoming messages, resume when limit resets
Cost runaway MEDIUM Detect via Anthropic API usage endpoint, auto-disable bot, send alert, manual review before re-enable
State divergence (--resume) HIGH Compare expected vs actual transcript hash on resume, reject resume if mismatch, fallback to fresh session with context summary
Race condition on timeout LOW Log all process lifecycle events, correlate timestamps to identify race, fix with locking, restart affected user sessions

Pitfall-to-Phase Mapping

How roadmap phases should address these pitfalls.

Pitfall Prevention Phase Verification
Asyncio subprocess PIPE deadlock Phase 1: Core Subprocess Test with synthetic >64KB output, verify no hang
Telegram rate limit cascade Phase 2: Telegram Integration Stress test: send 100 rapid messages, verify batching/throttling works
Zombie process accumulation Phase 1: Core Subprocess Kill bot during active Claude call, restart, verify no zombies via ps aux
Session state corruption Phase 3: Session Management Test concurrent message bombardment (10 messages in 1s), verify state integrity
Claude Code --resume footgun Phase 4: Resume/Persistence Resume session, compare transcript hash, verify no divergence
Idle timeout race condition Phase 5: Idle Management Send message at T+timeout-1s, verify no BrokenPipeError
Cost runaway from failed Haiku handoff Phase 6: Cost Optimization Simulate 100 requests, verify model distribution matches expectations (80% Haiku)

Sources

Telegram Bot + Subprocess Management:

Asyncio Subprocess Pitfalls:

Zombie Processes:

Telegram API Limits:

Claude Code CLI Protocol:

systemd Process Management:

Python Asyncio Memory Leaks:

Race Conditions & Concurrency:

Claude API Cost Optimization:


Pitfalls research for: Telegram-to-Claude Code bridge (brownfield Python bot extension) Researched: 2026-02-04