homelab/.planning/research/PITFALLS.md
Mikkel Georgsen 1648a986bc docs: complete project research
Files:
- STACK.md
- FEATURES.md
- ARCHITECTURE.md
- PITFALLS.md
- SUMMARY.md

Key findings:
- Stack: Python 3.12+ with python-telegram-bot 22.6, asyncio subprocess management
- Architecture: Path-based session routing with state machine lifecycle management
- Critical pitfall: Asyncio PIPE deadlock requires concurrent stdout/stderr draining

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 13:37:24 +00:00

350 lines
25 KiB
Markdown

# Pitfalls Research
**Domain:** Telegram Bot + Long-Running CLI Subprocess Management
**Researched:** 2026-02-04
**Confidence:** HIGH
## Critical Pitfalls
### Pitfall 1: Asyncio Subprocess PIPE Deadlock
**What goes wrong:**
Using `asyncio.create_subprocess_exec` with `stdout=PIPE` and `stderr=PIPE` causes the subprocess to hang indefinitely when output buffers fill. The parent process awaits `proc.wait()` while the child blocks writing to the full pipe buffer, creating a classic deadlock. This is especially critical with Claude Code CLI which produces continuous streaming output.
**Why it happens:**
OS pipe buffers are finite (typically 64KB on Linux). When the child process generates more output than the buffer can hold, it blocks on write(). If the parent isn't actively draining the pipe via `proc.stdout.read()`, the pipe fills and both processes wait forever - child waits for buffer space, parent waits for process exit.
**How to avoid:**
- Use `asyncio.create_task()` to drain stdout/stderr concurrently while waiting for process
- Or use `proc.communicate()` which handles draining automatically
- Or redirect to files instead: `stdout=open('log.txt', 'w')` to bypass pipe limits
- Never call `proc.wait()` when using PIPE without concurrent reading
**Warning signs:**
- Bot hangs on specific commands that produce verbose output
- Process remains in "S" state (sleeping) indefinitely
- `strace` shows both processes blocked on read/write syscalls
- Works with short output, hangs with verbose Claude responses
**Phase to address:**
Phase 1: Core Subprocess Management - implement proper async draining patterns before any Claude integration.
---
### Pitfall 2: Telegram API Rate Limit Cascade Failures
**What goes wrong:**
When Claude Code generates output faster than Telegram allows sending (30 messages/second, 20/minute in groups), messages queue up. Without proper backpressure handling, the bot triggers `429 Too Many Requests` errors, gets rate-limited for increasing durations (exponential backoff), and eventually the entire message queue fails. Users see partial responses or total silence.
**Why it happens:**
Claude's streaming responses don't know or care about Telegram's rate limits. A single Claude interaction can produce hundreds of lines of output. Naive implementations send each chunk immediately, overwhelming Telegram's API and triggering automatic rate limiting that cascades to ALL bot operations, not just Claude responses.
**How to avoid:**
- Implement message batching: accumulate output for 1-2 seconds before sending
- Use `telegram.ext.Application`'s built-in rate limiter (v20.x+)
- Add exponential backoff with `asyncio.sleep()` on 429 errors
- Track messages/second and throttle proactively before hitting limits
- Consider chunking very long output and offering "download full log" instead
**Warning signs:**
- HTTP 429 errors in logs
- Messages arrive in bursts after long delays
- Bot becomes unresponsive to ALL commands during Claude sessions
- Telegram sends "FloodWait" exceptions with increasing wait times
**Phase to address:**
Phase 2: Telegram Integration - must be solved before exposing Claude streaming output to users.
---
### Pitfall 3: Zombie Process Accumulation
**What goes wrong:**
When the bot crashes, restarts, or processes are killed improperly, Claude Code subprocesses become zombies - still running, consuming resources, but detached from parent. On a 4GB LXC container, a few zombie processes can exhaust memory. After days/weeks, dozens of orphaned Claude processes pile up.
**Why it happens:**
Python's asyncio doesn't automatically clean up child processes on exception or when event loop closes. Calling `proc.kill()` without `await proc.wait()` leaves process in zombie state. systemd restarts don't adopt orphaned children. The Telegram bot's event loop may close while subprocesses are mid-execution.
**How to avoid:**
- Always `await proc.wait()` after termination signals
- Use `try/finally` to ensure cleanup even on exceptions
- Configure systemd `KillMode=control-group` to kill entire process tree on restart
- Implement graceful shutdown handler that waits for all subprocesses
- Use process tracking: maintain dict of active PIDs, verify cleanup on startup
**Warning signs:**
- `ps aux | grep claude` shows processes with different PPIDs or PPID=1
- Memory usage creeps up over days without corresponding active sessions
- Process count increases but active users count doesn't
- `defunct` or `<zombie>` processes in process table
**Phase to address:**
Phase 1: Core Subprocess Management - proper lifecycle management must be foundational.
---
### Pitfall 4: Session State Corruption via Race Conditions
**What goes wrong:**
When a user sends multiple Telegram messages rapidly while Claude is processing, concurrent writes to the session state file corrupt data. Session JSON becomes malformed, context is lost, Claude forgets conversation history mid-interaction. In worst case, file locking fails and two processes write simultaneously, producing invalid JSON that crashes the bot.
**Why it happens:**
Telegram's async handlers run concurrently. Message 1 starts Claude subprocess, Message 2 arrives before Message 1 finishes, both try to update `sessions/{user_id}.json`. Python's file I/O isn't atomic - one write can partially overwrite another. `json.dump()` + `f.write()` is not atomic across asyncio tasks.
**How to avoid:**
- Use `asyncio.Lock` per user: `user_locks[user_id]` ensures serial access to session state
- Or use `filelock` library for cross-process file locking
- Implement atomic writes: write to temp file, then `os.rename()` (atomic on POSIX)
- Queue user messages: new message while Claude active goes to pending queue, processed after current finishes
- Detect corruption: catch `json.JSONDecodeError` on read, backup corrupted file, start fresh session
**Warning signs:**
- `json.JSONDecodeError` in logs
- Users report "bot forgot our conversation"
- Sporadic failures only when users type quickly
- Session files contain partial/mixed JSON from multiple writes
- File size is unexpectedly small (truncation during write)
**Phase to address:**
Phase 3: Session Management - after basic subprocess handling works, before multi-user testing.
---
### Pitfall 5: Claude Code CLI --resume Footgun
**What goes wrong:**
Using `--resume` flag naively to continue sessions seems ideal, but leads to state divergence. The CLI's internal state (transcript, tool outputs, context window) drifts from what the bot thinks happened. Bot displays response A to user, but Claude's transcript shows response B due to regeneration during resume. Messages appear out of order or duplicated.
**Why it happens:**
`--resume` replays the transcript from disk and may regenerate responses if conditions changed (model version updated, non-deterministic sampling). The bot's session state stores "what we showed the user", but Claude's resumed state reflects "what actually happened in the transcript". These diverge over time, especially with tool use where results may differ on replay.
**How to avoid:**
- Avoid `--resume` entirely: start fresh subprocess per interaction, pass conversation history via stdin
- Or implement "resume detection": compare Claude's first message after resume with expected cached response, warn on mismatch
- Or treat --resume as read-only: use it to show transcript to user, but always start fresh for new input
- Store transcript path in session state, verify hash/checksum before resume to detect corruption
**Warning signs:**
- Users see repeated messages they already received
- Bot shows different response than what Claude transcript contains
- Tool use executes twice with different results
- Resume succeeds but conversation context is wrong
**Phase to address:**
Phase 4: Resume/Persistence - only after basic interaction flow is solid, requires deep understanding of transcript format.
---
### Pitfall 6: Idle Timeout Race Condition
**What goes wrong:**
Implementing "kill Claude after N minutes idle" creates a race: user sends message at T+599s, timeout fires at T+600s, both try to access the subprocess. Timeout calls `proc.kill()` while message handler calls `proc.stdin.write()`. Result: `BrokenPipeError`, message lost, user sees error instead of Claude response. In worse case, timeout cleanup runs mid-response, truncating output.
**Why it happens:**
Asyncio's `asyncio.wait_for()` and timeout tasks don't coordinate with message arrival. The timeout coroutine has no knowledge that a new message just started processing. Both coroutines operate on shared subprocess state without synchronization. Telegram's async handlers run immediately on message arrival, possibly overlapping with timeout logic.
**How to avoid:**
- Cancel timeout task BEFORE starting message processing: `timeout_task.cancel()` in message handler
- Use `asyncio.Lock` to prevent timeout cleanup during active message handling
- Implement "last activity" timestamp: timeout checks timestamp and skips cleanup if recent
- Set timeout generously (10min+) to reduce race window
- Log timeout decisions: "Killing process for user X due to idle since Y" helps debug races
**Warning signs:**
- Intermittent `BrokenPipeError` or `ValueError: I/O operation on closed file`
- Happens more often exactly at timeout threshold (e.g., always near 5min mark)
- Users report "bot randomly stops responding" mid-conversation
- Logs show process killed, then immediately new message arrives
- Error rate correlates with idle timeout duration
**Phase to address:**
Phase 5: Idle Management - only add after core interaction loop is bulletproof, requires careful async coordination.
---
### Pitfall 7: Cost Runaway from Failed Haiku Handoff
**What goes wrong:**
The plan is to use Haiku for light tasks, escalate to Opus for complex reasoning. But if escalation logic fails (Haiku doesn't recognize complexity, or handoff mechanism breaks), every request goes to Opus. A user asks 100 simple questions ("what's the weather?") and you burn through $25 in token costs instead of $1. Monthly bill explodes from $50 to $500.
**Why it happens:**
Model routing is fragile: Haiku's job is to decide "do I need Opus?" but it may be too dumb to know when it's too dumb. Complexity heuristics (token count, tool use, keywords) have false negatives. Bugs in handoff code (wrong model parameter, API error) cause fallback to default model (often the expensive one). No budget enforcement means runaway costs go unnoticed until the bill arrives.
**How to avoid:**
- Implement per-user daily/monthly cost caps: track tokens used, reject requests over limit
- Log every model decision: "User X, message Y: using Haiku because Z" for audit trail
- Monitor cost metrics in real-time: alert if hourly spend exceeds threshold
- Start with Haiku-only, add Opus escalation LATER once metrics show handoff works
- Use prompt engineering: system prompt tells Haiku "If you're uncertain, say 'I need help' instead of trying"
- Test escalation logic extensively with edge cases before production
**Warning signs:**
- Anthropic usage dashboard shows 90%+ Opus when expecting 80%+ Haiku
- Daily spend consistently above projected average
- Logs show no/few Haiku->Opus escalation events (suggests routing broken)
- Users report slow responses (Opus is slower) when they expected fast replies
- Cost-per-interaction metric increases over time without feature changes
**Phase to address:**
Phase 6: Cost Optimization - start Haiku-only in Phase 2, defer Opus handoff until usage patterns are understood.
---
## Technical Debt Patterns
Shortcuts that seem reasonable but create long-term problems.
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Using `subprocess.run()` instead of asyncio subprocess | Simpler code, no async complexity | Blocks event loop, bot unresponsive during Claude calls, Telegram timeouts | Never - breaks async bot entirely |
| Storing session state in memory only (no persistence) | Fast, no file I/O, no corruption risk | Sessions lost on restart, can't implement --resume, no audit trail | MVP only - add persistence by Phase 3 |
| Single global Claude subprocess for all users | Simple: one process to manage, no spawn overhead | Security nightmare (cross-user context leak), single point of failure, no isolation | Never - violates basic security |
| No cost tracking, assume Haiku is cheap enough | Faster development, less code | Budget surprises, no visibility into usage patterns, can't optimize | Early testing only - add tracking by Phase 2 GA |
| Sending full stdout line-by-line to Telegram | Simple: `for line in stdout`, looks responsive | Rate limiting, message spam, user annoyance, API costs | Never - batch messages or stream differently |
| Killing process with `SIGKILL` instead of graceful shutdown | Reliable: process always dies immediately | No cleanup, zombie risk, corrupted state, tool operations interrupted | Emergency fallback only - use `SIGTERM` first |
## Integration Gotchas
Common mistakes when connecting to external services.
| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| Claude Code CLI | Assuming stdout contains only assistant messages | Parse JSON-lines protocol: distinguish between message types (assistant, tool, control), filter accordingly |
| Claude Code CLI | Using interactive mode (no --stdin) | Always use `--stdin` flag for programmatic control, never rely on terminal interaction |
| Telegram python-telegram-bot | Calling blocking functions in async handlers | Use `asyncio.to_thread()` for sync code, or use async subprocess APIs |
| Telegram API | Assuming message sends succeed | Handle `telegram.error.RetryAfter` (rate limit), `NetworkError` (connectivity), retry with exponential backoff |
| systemd service | Relying on `Type=simple` with asyncio | Use `Type=exec` or `Type=notify` to ensure systemd knows when service is ready, prevents premature "active" status |
| File system (inbox, sessions) | Concurrent read/write without locking | Use `filelock` library or `asyncio.Lock` for critical sections, ensure atomic operations |
## Performance Traps
Patterns that work at small scale but fail as usage grows.
| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| One subprocess per message (spawn overhead) | High CPU during bursts, slow response time | Reuse subprocess across messages in same session, only spawn once per user interaction thread | >10 messages/minute per user |
| Loading full transcript on every message | Increasing latency as conversation grows | Implement transcript pagination, only load recent context + summary | >100 messages per session (~50KB transcript) |
| Synchronous file writes to session state | Bot lag spikes during saves, Telegram timeouts | Use async file I/O (`aiofiles`) or offload to background task | >5 concurrent users writing state |
| Unbounded message queue per user | Memory grows without limit if Claude is slow | Implement queue size limit (e.g., 10 pending messages), reject new messages when full | User sends >20 messages while waiting |
| Regex parsing of Claude output line-by-line | CPU spikes with verbose responses | Parse once per message chunk, not per line; use JSON protocol when possible | Claude outputs >1000 lines |
| Keeping all session objects in memory | Works fine... until OOM | Implement LRU cache with max size, evict inactive sessions after timeout | >50 concurrent sessions on 4GB RAM |
## Security Mistakes
Domain-specific security issues beyond general web security.
| Mistake | Risk | Prevention |
|---------|------|------------|
| Trusting Telegram user_id without verification | Malicious user spoofs authorized user ID via API | Check `authorized_users` file on EVERY command, validate against Telegram's cryptographic signatures |
| Passing user input directly to subprocess args | Command injection: user sends `/ping; rm -rf /` | Strict input validation, use shlex.quote(), never use shell=True |
| Exposing Claude Code's file system access to users | User asks Claude "read /etc/shadow", Claude complies | Implement tool use filtering, whitelist allowed paths, run Claude subprocess in restricted namespace |
| Storing Telegram bot token in code or world-readable file | Token leak allows full bot takeover | Store in `credentials` file with 600 permissions, never commit to git |
| No rate limiting on expensive operations | DoS: user spams bot with Claude requests until OOM/cost limit | Per-user rate limit (e.g., 10 messages/hour), queue depth limit, kill runaway processes |
| Logging sensitive data (messages, API keys) | Log leakage exposes private conversations | Redact message content in logs, only log metadata (user_id, timestamp, status) |
## UX Pitfalls
Common user experience mistakes in this domain.
| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| No feedback while Claude thinks | User waits in silence, assumes bot is broken | Send "Claude is thinking..." immediately, update with "..." every 5s, show typing indicator |
| Dumping full Claude output as single 4000-char message | Wall of text, hard to read, loses context | Split into logical chunks (by paragraph/section), send as multiple messages with slight delay |
| No way to stop runaway Claude response | User watches helplessly as bot spams hundreds of lines | Implement `/stop` command, show progress "Sending response X/Y", allow cancellation |
| Silent failures | Message disappears into void, no error message | Always confirm receipt: "Got it, processing..." or "Error: rate limit, try again" |
| No context on what Claude knows | User confused why bot remembers/forgets things | Show session state: "Session started 10 min ago, 5 messages" or "New session (use /resume to continue)" |
| Cryptic error messages from Claude subprocess | "Error: exit code 1" means nothing to user | Parse Claude's stderr, translate to user-friendly: "Claude encountered an error: [specific reason]" |
## "Looks Done But Isn't" Checklist
Things that appear complete but are missing critical pieces.
- [ ] **Subprocess cleanup:** Often missing `await proc.wait()` after kill - verify all code paths call wait()
- [ ] **Error handling on Telegram API:** Often missing retry logic on 429/5xx - verify every `await bot.send_*()` has try/except
- [ ] **File locking for session state:** Often missing locks on concurrent read/modify/write - verify atomicity with `filelock` tests
- [ ] **Graceful shutdown:** Often missing SIGTERM handler - verify systemd restart doesn't leave zombies via `ps aux` check
- [ ] **Cost tracking:** Often logs tokens but doesn't enforce limits - verify limit exceeded actually rejects requests
- [ ] **Idle timeout cancellation:** Often sets timeout but forgets to cancel on new activity - test rapid message burst at T+timeout-1s
- [ ] **Output buffering/draining:** Often uses PIPE but forgets to drain - test with verbose Claude output (>100KB)
- [ ] **Model selection logging:** Often switches models but doesn't log decision - verify audit trail shows which model was used and why
## Recovery Strategies
When pitfalls occur despite prevention, how to recover.
| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| Deadlocked subprocess | LOW | Detect via timeout on `proc.wait()`, send SIGKILL, cleanup session state, notify user "session crashed, please retry" |
| Zombie process accumulation | LOW | Scan for zombies on startup (`ps -eo pid,ppid,stat,cmd`), kill all matching Claude processes, clear stale session files |
| Corrupted session state | LOW | Catch `json.JSONDecodeError`, backup corrupted file to `sessions/corrupted/{user_id}_{timestamp}.json`, start fresh session |
| Rate limit cascade | MEDIUM | Pause all message processing for backoff duration (from 429 response), queue incoming messages, resume when limit resets |
| Cost runaway | MEDIUM | Detect via Anthropic API usage endpoint, auto-disable bot, send alert, manual review before re-enable |
| State divergence (--resume) | HIGH | Compare expected vs actual transcript hash on resume, reject resume if mismatch, fallback to fresh session with context summary |
| Race condition on timeout | LOW | Log all process lifecycle events, correlate timestamps to identify race, fix with locking, restart affected user sessions |
## Pitfall-to-Phase Mapping
How roadmap phases should address these pitfalls.
| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| Asyncio subprocess PIPE deadlock | Phase 1: Core Subprocess | Test with synthetic >64KB output, verify no hang |
| Telegram rate limit cascade | Phase 2: Telegram Integration | Stress test: send 100 rapid messages, verify batching/throttling works |
| Zombie process accumulation | Phase 1: Core Subprocess | Kill bot during active Claude call, restart, verify no zombies via `ps aux` |
| Session state corruption | Phase 3: Session Management | Test concurrent message bombardment (10 messages in 1s), verify state integrity |
| Claude Code --resume footgun | Phase 4: Resume/Persistence | Resume session, compare transcript hash, verify no divergence |
| Idle timeout race condition | Phase 5: Idle Management | Send message at T+timeout-1s, verify no BrokenPipeError |
| Cost runaway from failed Haiku handoff | Phase 6: Cost Optimization | Simulate 100 requests, verify model distribution matches expectations (80% Haiku) |
## Sources
**Telegram Bot + Subprocess Management:**
- [Building Robust Telegram Bots](https://henrywithu.com/building-robust-telegram-bots/)
- [Common Mistakes When Building Telegram Bots with Node.js](https://infinitejs.com/posts/common-mistakes-telegram-bots-nodejs/)
- [python-telegram-bot Concurrency Wiki](https://github.com/python-telegram-bot/python-telegram-bot/wiki/Concurrency)
- [GitHub Issue #3887: PTB Hangs with Large Update Volume](https://github.com/python-telegram-bot/python-telegram-bot/issues/3887)
**Asyncio Subprocess Pitfalls:**
- [Python CPython Issue #115787: Deadlock in create_subprocess_exec with Semaphore and PIPE](https://github.com/python/cpython/issues/115787)
- [Python.org Discussion: Details of process.wait() Deadlock](https://discuss.python.org/t/details-of-process-wait-deadlock/69481)
- [Python Official Docs: Asyncio Subprocesses](https://docs.python.org/3/library/asyncio-subprocess.html)
**Zombie Processes:**
- [Python asyncio Issue #281: Zombies with set_event_loop(None)](https://github.com/python/asyncio/issues/281)
- [Python CPython Issue #95899: Runner+PidfdChildWatcher Leaves Zombies](https://github.com/python/cpython/issues/95899)
- [Sling Academy: Python asyncio - How to Stop/Kill a Child Process](https://www.slingacademy.com/article/python-asyncio-how-to-stop-kill-a-child-process/)
**Telegram API Limits:**
- [Telegram Bots FAQ: Rate Limits](https://core.telegram.org/bots/faq)
- [Telegram Limits Reference](https://limits.tginfo.me/en)
- [BigMike.help: Local Telegram Bot API Advantages](https://bigmike.help/en/case/local-telegram-bot-api-advantages-limitations-of-the-standard-api-and-set-eb4a3b/)
**Claude Code CLI Protocol:**
- [Inside the Claude Agent SDK: stdin/stdout Communication](https://buildwithaws.substack.com/p/inside-the-claude-agent-sdk-from)
- [Claude Code CLI Reference](https://code.claude.com/docs/en/cli-reference)
- [Building an MCP Server for Claude Code](https://veelenga.github.io/building-mcp-server-for-claude/)
**systemd Process Management:**
- [systemd Advanced Guide for 2026](https://medium.com/@springmusk/systemd-advanced-guide-for-2026-b2fe79af3e78)
- [Arch Linux Forums: Restart systemd Service Without Killing Children](https://bbs.archlinux.org/viewtopic.php?id=212380)
- [systemd.service Manual](https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html)
**Python Asyncio Memory Leaks:**
- [Python CPython Issue #85865: Memory Leak with asyncio and run_in_executor](https://github.com/python/cpython/issues/85865)
- [Victor Stinner: asyncio WSASend() Memory Leak](https://vstinner.github.io/asyncio-proactor-wsasend-memory-leak.html)
**Race Conditions & Concurrency:**
- [Medium: Avoiding File Conflicts in Multithreaded Python](https://medium.com/@aman.deep291098/avoiding-file-conflicts-in-multithreaded-python-programs-34f2888f4521)
- [Super Fast Python: Multiprocessing Race Conditions](https://superfastpython.com/multiprocessing-race-condition-python/)
- [Python CPython Issue #92824: asyncio.wait_for() Race Conditions](https://github.com/python/cpython/issues/92824)
- [Nicholas: Race Conditions with asyncio in Python](https://nicholaslyz.com/blog/2024/03/22/race-conditions-with-asyncio-in-python/)
**Claude API Cost Optimization:**
- [Claude API Pricing Guide 2026](https://www.aifreeapi.com/en/posts/claude-api-pricing-per-million-tokens)
- [MetaCTO: Anthropic Claude API Pricing 2026](https://www.metacto.com/blogs/anthropic-api-pricing-a-full-breakdown-of-costs-and-integration)
- [Finout: Anthropic API Pricing & Cost Optimization Strategies](https://www.finout.io/blog/anthropic-api-pricing)
- [GitHub Issue #17772: Programmatic Model Switching for Autonomous Agents](https://github.com/anthropics/claude-code/issues/17772)
---
*Pitfalls research for: Telegram-to-Claude Code bridge (brownfield Python bot extension)*
*Researched: 2026-02-04*