# Project Research Summary

**Project:** Telegram-to-Claude Code Bridge
**Domain:** AI chatbot integration / Long-running subprocess management
**Researched:** 2026-02-04
**Confidence:** HIGH

## Executive Summary

This project extends an existing single-user Telegram bot to spawn and manage Claude Code CLI subprocesses, enabling conversational AI assistance via Telegram with persistent sessions. The core challenge is managing long-running interactive CLI processes through asyncio while avoiding common pitfalls like pipe deadlocks, zombie processes, and rate limiting cascades.

The recommended approach uses Python 3.12+ with python-telegram-bot 22.6 for the Telegram interface, asyncio subprocess management for Claude Code CLI integration, and path-based session routing with isolated filesystem directories. Each session maps to a directory containing metadata, conversation history, and file attachments. The architecture implements a state machine (IDLE → SPAWNING → ACTIVE → PROCESSING → SUSPENDED) with idle timeout monitors to prevent resource exhaustion on the 4GB container.

The critical risks are asyncio subprocess PIPE deadlocks (mitigated by concurrent stdout/stderr draining), zombie process accumulation (mitigated by proper lifecycle management with try/finally cleanup), and Telegram API rate limiting (mitigated by message batching and backpressure handling). Cost optimization through Haiku/Opus model routing should be deferred until core functionality is proven, as routing complexity introduces significant failure modes.

## Key Findings

### Recommended Stack

The stack leverages existing infrastructure (Python 3.12.3, systemd user service) and adds modern async libraries optimized for I/O-bound subprocess management. All dependencies are available in recent stable versions with good asyncio integration.

**Core technologies:**
- **Python 3.12+**: Already deployed, excellent asyncio support, required by python-telegram-bot 22.6
- **python-telegram-bot 22.6**: Latest stable (Jan 2026), native async/await, httpx-based, supports Bot API 9.3
- **asyncio (stdlib)**: Native subprocess management with create_subprocess_exec, non-blocking I/O for concurrent sessions
- **aiofiles 25.1.0**: Async file I/O for session logs and file uploads without blocking event loop
- **APScheduler 3.11.2**: Job scheduling for idle timeout timers, AsyncIOScheduler supports native coroutines

**Key pattern:**
Use `asyncio.create_subprocess_exec(['claude', '--resume'], cwd=session_path, stdout=PIPE, stderr=PIPE)` with separate async reader tasks to avoid deadlocks. Never use `communicate()` for interactive processes. Session isolation achieved through filesystem paths, not security boundaries.

### Expected Features

Research indicates a clear MVP path with features that users expect from chat-based AI assistants, plus differentiators that leverage Claude Code's unique capabilities.

**Must have (table stakes):**
- Basic message send/receive — core functionality
- Session persistence — conversations survive bot restarts
- Typing indicator — expected for 10-60s AI response times
- File upload/download — send files to Claude, receive generated outputs
- Error messages — clear feedback when things break
- Multi-message handling — split long responses at 4096 char Telegram limit
- Authentication — user whitelist (single-user: one ID)

**Should have (competitive):**
- Named session management — switch between projects (/session homelab, /session dev)
- Idle timeout with suspend/resume — auto-suspend after 10min idle, save costs
- Session-specific folders — isolated file workspace per session
- Cost tracking per session — show token usage and $ cost
- Inline keyboard menus — button-based navigation
- Image analysis — send photos, Claude analyzes with vision

**Defer (v2+):**
- Smart output modes — AI decides verbosity based on context (HIGH complexity)
- Tool call progress notifications — real-time updates (HIGH complexity, rate limit risk)
- Multi-model routing — Haiku for simple, Opus for complex (HIGH complexity, cost runaway risk)
- Voice message support — transcription via Whisper (HIGH complexity)
- Multi-user support — requires tenant isolation, auth complexity

### Architecture Approach

The architecture implements a layered design with clear separation of concerns: Telegram event handling, session routing, process lifecycle management, and output formatting. Each layer has a single responsibility and communicates through well-defined interfaces.

**Major components:**
1. **Bot Event Loop** — Receives Telegram updates, dispatches to handlers via python-telegram-bot Application
2. **SessionRouter** — Maps chat_id to session path, creates directories, loads/saves metadata
3. **Session (state machine)** — Owns lifecycle transitions (IDLE → SPAWNING → ACTIVE → PROCESSING → SUSPENDED), tracks last activity
4. **ProcessManager** — Spawns Claude CLI subprocess with asyncio.create_subprocess_exec, manages stdin/stdout/stderr streams with separate reader tasks
5. **StreamParser** — Parses Claude output (assumes stream-json or line-by-line text), accumulates chunks
6. **ResponseFormatter** — Applies Telegram Markdown, splits at 4096 chars, handles code blocks
7. **IdleMonitor** — Background task checks last_activity timestamps every 60s, suspends idle sessions

**Data flow:** Telegram Update → Handler → Router → Session → ProcessManager → Claude stdin. Claude stdout → Reader task → Parser → Formatter → Telegram API. Files saved to sessions/<name>/images/ or files/, logged to conversation.jsonl.

### Critical Pitfalls

Research identified seven major failure modes, prioritized by impact and likelihood.

1. **Asyncio Subprocess PIPE Deadlock** — OS pipe buffers fill (64KB) when Claude produces verbose output, child blocks on write(), parent waits for exit, both hang forever. **Avoidance:** Use asyncio.create_task() to drain stdout/stderr concurrently, never call proc.wait() when using PIPE without concurrent reading.

2. **Telegram API Rate Limit Cascade** — Claude streams output faster than Telegram allows (30 msg/sec), triggers 429 errors, cascades to ALL bot operations. **Avoidance:** Implement message batching (accumulate 1-2s before sending), use python-telegram-bot's built-in rate limiter, add exponential backoff on 429.

3. **Zombie Process Accumulation** — Bot crashes/restarts leave orphaned Claude processes consuming memory, exhaust 4GB container. **Avoidance:** Always await proc.wait() after termination, use try/finally cleanup, configure systemd KillMode=control-group, verify cleanup on startup.

4. **Session State Corruption via Race Conditions** — Concurrent writes to metadata.json corrupt data when user sends rapid messages. **Avoidance:** Use asyncio.Lock per user, atomic writes (write to temp, os.rename()), queue messages (new message while active goes to pending queue).

5. **Idle Timeout Race Condition** — User sends message at T+599s, timeout fires at T+600s, both access subprocess, BrokenPipeError. **Avoidance:** Cancel timeout task BEFORE message processing, use asyncio.Lock, check last_activity timestamp before cleanup.

## Implications for Roadmap

Based on research, the project should be built in 5-6 phases with strict ordering to ensure foundational patterns are correct before adding complexity.

### Phase 1: Session Foundation
**Rationale:** Must establish multi-session filesystem structure and routing BEFORE subprocess complexity. Path-based isolation is foundational — nearly everything depends on solid session management per FEATURES.md dependency analysis.

**Delivers:** Session class with metadata.json schema, SessionRouter with chat_id → session_name mapping, /session <name> command, conversation.jsonl append logging.

**Addresses:** Session persistence (table stakes), named session management (differentiator), session folders (differentiator).

**Avoids:** Session state corruption pitfall by establishing atomic write patterns and locking semantics early.

### Phase 2: Process Management
**Rationale:** Core subprocess integration must be bulletproof before adding Telegram integration. ARCHITECTURE.md build order explicitly sequences StreamParser → ProcessManager → Session integration. Must avoid PIPE deadlock pitfall from day one.

**Delivers:** ProcessManager with asyncio.create_subprocess_exec, separate stdout/stderr reader tasks, graceful shutdown with try/finally cleanup, StreamParser for Claude output.

**Uses:** asyncio stdlib subprocess, proper draining patterns from STACK.md.

**Implements:** ProcessManager and StreamParser components from ARCHITECTURE.md.

**Avoids:** Asyncio PIPE deadlock (#1 critical pitfall) and zombie process accumulation (#3 critical pitfall) through proper lifecycle management.

### Phase 3: Telegram Integration
**Rationale:** With subprocess management working, integrate with Telegram API and handle rate limiting. ARCHITECTURE.md sequences formatter → session integration → file handling.

**Delivers:** TelegramFormatter for message chunking and Markdown, integration with bot handlers, file upload/download to session directories, typing indicator, error messages.

**Addresses:** Message splitting, typing indicator, file upload/download, error handling (all table stakes from FEATURES.md).

**Avoids:** Telegram rate limit cascade (#2 critical pitfall) through message batching and backpressure.

### Phase 4: Idle Management
**Rationale:** Only add idle timeout AFTER core interaction loop is proven. PITFALLS.md explicitly warns "only add after core interaction loop is bulletproof, requires careful async coordination."

**Delivers:** IdleMonitor background task, last_activity tracking, graceful suspend on timeout, transparent resume on next message.

**Addresses:** Idle timeout with suspend/resume (differentiator from FEATURES.md).

**Implements:** IdleMonitor component from ARCHITECTURE.md.

**Avoids:** Idle timeout race condition (#5 critical pitfall) through timeout task cancellation and locking.

### Phase 5: Production Hardening
**Rationale:** Add observability, error recovery, and session cleanup after core features work. ARCHITECTURE.md Phase 5 focuses on error handling, session recovery, monitoring.

**Delivers:** Error handling with retry logic, session recovery on bot restart (scan sessions/, transition ACTIVE → SUSPENDED), /sessions and /session_stats commands, structured logging.

**Addresses:** Operational requirements not captured in feature research.

**Avoids:** Technical debt accumulation by codifying error handling patterns early.

### Phase 6: Cost Optimization (DEFER)
**Rationale:** Multi-model routing (Haiku/Opus) should be deferred until usage patterns are clear. PITFALLS.md identifies cost runaway as critical risk (#7), STACK.md recommends "start Haiku-only in Phase 2, defer Opus handoff until usage patterns understood."

**Delivers:** ModelSelector for command vs conversation classification, Haiku for monitoring commands (/status, /pbs), Opus for conversation, cost tracking and limits.

**Addresses:** Cost tracking (differentiator), multi-model routing (deferred feature).

**Avoids:** Cost runaway from failed Haiku handoff (#7 critical pitfall) by deferring until metrics validate routing logic.

### Phase Ordering Rationale

- **Sessions first, subprocess second:** FEATURES.md dependency graph shows session management is foundational. Path-based routing must work before spawning processes in those paths.
- **Process management isolated from Telegram:** ARCHITECTURE.md build order separates subprocess concerns (Phase 2) from Telegram integration (Phase 3). This allows testing Claude interaction without rate limiting complications.
- **Idle timeout only after core proven:** PITFALLS.md explicitly warns about idle timeout race conditions. Adding timeout logic to unproven interaction loop creates debugging nightmare.
- **Cost optimization last:** PITFALLS.md shows model routing complexity creates failure modes (wrong model, fallback bugs, heuristic failures). Defer until core value proven and usage data available for optimization.

### Research Flags

Phases likely needing deeper research during planning:
- **Phase 2 (Process Management):** Claude Code CLI --resume behavior with pipes vs PTY unknown, output format for tool calls not documented, needs empirical testing.
- **Phase 3 (Telegram Integration):** Message batching strategy needs validation against actual Claude output patterns, chunk split points require experimentation.

Phases with standard patterns (skip research-phase):
- **Phase 1 (Session Foundation):** Filesystem-based session management is well-documented, JSON schema is straightforward.
- **Phase 4 (Idle Management):** APScheduler patterns are standard, timeout logic is proven pattern.
- **Phase 5 (Production Hardening):** Error handling and logging are general Python best practices.

## Confidence Assessment

| Area | Confidence | Notes |
|------|------------|-------|
| Stack | HIGH | All dependencies verified with official PyPI/documentation sources, versions current as of Jan 2026 |
| Features | HIGH | Based on official Telegram Bot documentation and multiple current implementations (OpenClaw, claude-code-telegram, Claude-Code-Remote) |
| Architecture | HIGH | Asyncio subprocess patterns verified with Python official docs, state machine approach proven in OpenClaw session management |
| Pitfalls | HIGH | Deadlock pitfalls documented in Python CPython issues, rate limiting in Telegram official docs, zombie processes in asyncio issue tracker |

**Overall confidence:** HIGH

All four research areas grounded in official documentation and verified with multiple independent sources. Stack versions confirmed via API queries (not training data). Architecture patterns validated against Python stdlib documentation. Pitfalls sourced from actual bug reports and issue trackers, not speculation.

### Gaps to Address

While confidence is high, some areas require empirical validation during implementation:

- **Claude Code CLI output format:** Documentation mentions stream-json support but exact event schema not published. Will need to test `--output-format stream-json` flag and parse actual output to determine message boundaries, tool call markers, and error formats.

- **Claude Code --resume behavior:** Whether --resume preserves context across process restarts with stdin/stdout pipes (vs requiring TTY) is not documented. STACK.md notes "needs testing" for TTY detection. May need ptyprocess library if pipes insufficient.

- **Optimal idle timeout duration:** 10 minutes suggested based on general chatbot patterns, but actual usage may require tuning. Monitor session activity patterns in Phase 4 to optimize.

- **Message batching strategy:** 1-2 second accumulation recommended to avoid rate limits, but optimal batch size depends on Claude response patterns. Phase 3 should experiment with chunk sizes and timing.

- **Resource usage per session:** Claude Code memory footprint estimated at 100-300MB but not verified. Phase 2 should monitor with process.cpu_percent() and adjust concurrent session limits if needed.

## Sources

### Primary (HIGH confidence)
- [python-telegram-bot PyPI](https://pypi.org/project/python-telegram-bot/) — Version 22.6, dependencies, API compatibility
- [Python asyncio subprocess documentation](https://docs.python.org/3/library/asyncio-subprocess.html) — Process class, create_subprocess_exec, deadlock warnings
- [Claude Code CLI Reference](https://code.claude.com/docs/en/cli-reference) — CLI flags, --resume, --output-format, --no-interactive
- [Telegram Bot API Documentation](https://core.telegram.org/bots/api) — Rate limits, message format, file handling
- [APScheduler PyPI](https://pypi.org/project/APScheduler/) — Version 3.11.2, AsyncIOScheduler
- [aiofiles PyPI](https://pypi.org/project/aiofiles/) — Version 25.1.0

### Secondary (MEDIUM confidence)
- [OpenClaw Telegram Bot Setup](https://macaron.im/blog/openclaw-telegram-bot-setup) — Session management patterns
- [claude-code-telegram GitHub](https://github.com/RichardAtCT/claude-code-telegram) — Implementation reference
- [Python CPython Issue #115787](https://github.com/python/cpython/issues/115787) — Subprocess PIPE deadlock details
- [Telegram Bots FAQ: Rate Limits](https://core.telegram.org/bots/faq) — API limits
- [Python asyncio Issue #281](https://github.com/python/asyncio/issues/281) — Zombie process patterns

### Tertiary (LOW confidence, needs validation)
- Claude Code CLI stream-json protocol schema — Not documented officially, requires empirical testing
- Claude Code subprocess resource usage — No published benchmarks, monitor in practice
- Optimal message batch timing for Telegram — Requires experimentation with actual Claude output

---
*Research completed: 2026-02-04*
*Ready for roadmap: yes*