Compare commits

..

No commits in common. "c23a6879816887f238edd941df49e352619926aa" and "c50c3480043955590f2060aa7fff177612b965a5" have entirely different histories.

55 changed files with 61 additions and 11980 deletions

View file

@ -1,70 +0,0 @@
# Telegram Claude Code Bridge
## What This Is
A Telegram bot frontend for Claude Code that lets Mikkel chat with Claude naturally from Telegram, with full tool access to the homelab. Messages flow through the existing Telegram bot to Claude Code CLI sessions, with session management, idle timeout, and file handling — turning Telegram into a mobile terminal for Claude Code.
## Core Value
Frictionless conversation with Claude Code from anywhere via Telegram — no SSH, no manual inbox checking, just message and get a response.
## Requirements
### Validated
- ✓ Telegram bot receives messages and saves to inbox — existing
- ✓ CLI helper sends messages back to Telegram — existing
- ✓ Bot handles photos and files — existing
- ✓ Helper scripts (pve, dns, pbs, beszel, kuma, etc.) available — existing
### Active
- [ ] Auto-respond: incoming Telegram message triggers Claude Code session that reads and responds
- [ ] Session management: path-based sessions (`~/telegram/sessions/<name>/`) with own Claude Code history and files
- [ ] Session switching: `/session <name>` command in Telegram to switch active session
- [ ] Session listing: `/sessions` command showing all sessions sorted by last activity
- [ ] Session creation: `/new <name>` creates a session; first-ever message auto-creates a default session
- [ ] Default session: messages go to last active session unless explicitly switched
- [ ] Idle timeout: configurable timer that gracefully suspends Claude Code session after inactivity
- [ ] Session resume: suspended sessions resume with full conversation history via `claude --resume`
- [ ] Cost-efficient polling: Haiku-based loop monitors for new messages, only spawns/resumes Opus session when needed
- [ ] Output modes: default shows final answer + tool-call progress notifications; toggleable `/verbose` (stream everything) and `/smart` (smart truncation with file attachments)
- [ ] File handling: any file type/size attached in Telegram gets saved to session folder and made available to Claude
- [ ] Full tool access: Claude Code sessions have same capabilities as terminal (SSH, helper scripts, file ops)
### Out of Scope
- Multi-user support — personal use only, single admin
- Web UI — Telegram is the interface
- Rate limiting / abuse prevention — trusted single user
- Message encryption beyond Telegram's built-in — trusted environment
## Context
- Existing Telegram bot runs as systemd user service (`telegram-bot.service`) on mgmt container (102)
- Bot written in Python using python-telegram-bot library
- Current bot handles: `/status`, `/pbs`, `/backups`, `/beszel`, `/kuma`, `/ping` commands
- Text messages saved to `~/homelab/telegram/inbox.json`, photos to `telegram/images/`, files to `telegram/files/`
- Claude Code CLI available on mgmt container, supports `--resume` for session continuation
- Helper scripts in `~/bin/` provide API access to all homelab services
- Python venv at `~/venv` with dependencies installed
## Constraints
- **Runtime**: Python, must integrate with existing telegram bot codebase
- **Container**: Runs on mgmt LXC (102), 4GB RAM, 4 CPU — must be resource-conscious
- **Cost**: Haiku for polling/monitoring, Opus only for actual conversation — minimize API spend
- **Single user**: Only Mikkel's Telegram account interacts with the bot
- **Systemd**: Must run as systemd user service like existing bot
## Key Decisions
| Decision | Rationale | Outcome |
|----------|-----------|---------|
| Claude Code CLI over raw API | Gets full tool access, --resume support, and all Claude Code features for free | — Pending |
| Haiku polling + Opus conversation | Avoids burning expensive Opus tokens on idle monitoring | — Pending |
| Path-based sessions | Leverages Claude Code's native session-per-directory behavior, files naturally scoped | — Pending |
| Extend existing bot | Reuse proven Telegram integration rather than building from scratch | — Pending |
---
*Last updated: 2026-02-04 after initialization*

View file

@ -1,98 +0,0 @@
# Requirements: Telegram Claude Code Bridge
**Defined:** 2026-02-04
**Core Value:** Frictionless conversation with Claude Code from anywhere via Telegram
## v1 Requirements
### Core Messaging
- [x] **MSG-01**: Incoming Telegram message auto-triggers Claude Code session and sends response back
- [x] **MSG-02**: Typing indicator shown in Telegram while Claude is processing
- [x] **MSG-03**: Brief tool-call progress notifications sent to Telegram (e.g. "Reading file...")
- [x] **MSG-04**: Files/photos attached in Telegram saved to session folder and available to Claude
### Session Management
- [x] **SESS-01**: Path-based sessions stored in `~/telegram/sessions/<name>/` with own Claude Code history
- [x] **SESS-02**: `/session <name>` command switches active session
- [x] **SESS-03**: `/sessions` command lists all sessions sorted by last activity
- [x] **SESS-04**: `/new <name>` creates new session; first-ever message auto-creates default session
### Lifecycle
- [x] **LIFE-01**: Configurable idle timeout suspends Claude Code session after inactivity
- [x] **LIFE-02**: Suspended sessions resume with full conversation history via `claude --resume`
- [x] **LIFE-03**: Graceful process cleanup on bot stop/restart (no zombie processes)
- [x] **LIFE-04**: `/timeout <minutes>` command changes idle timeout from Telegram
### Output Modes
- [x] **OUT-01**: Default mode: final answer + brief tool-call progress notifications
- [ ] **OUT-02**: `/verbose` mode: stream full Claude Code output across multiple messages
- [ ] **OUT-03**: `/smart` mode: smart truncation with long outputs sent as file attachments
### Infrastructure
- [x] **INFRA-01**: Runs as systemd user service alongside existing bot
- [x] **INFRA-02**: Async subprocess management via asyncio (no PIPE deadlocks)
- [x] **INFRA-03**: Concurrent stdout/stderr draining prevents buffer overflow
## v2 Requirements
### Cost Optimization
- **COST-01**: Haiku-based polling loop monitors for new messages, only spawns Opus for conversation
- **COST-02**: Route simple commands (/status, /pbs) to Haiku instead of Opus
- **COST-03**: Token usage tracking and reporting per session
### Advanced Features
- **ADV-01**: Session export (conversation history as markdown file)
- **ADV-02**: Session archiving (compress and move inactive sessions)
- **ADV-03**: Proactive notifications (Claude alerts about homelab events)
## Out of Scope
| Feature | Reason |
|---------|--------|
| Multi-user support | Personal use only, single admin |
| Web UI | Telegram is the interface |
| Rate limiting / abuse prevention | Trusted single user |
| Voice messages | High complexity, text is sufficient |
| Inline bot mode | Adds complexity, no benefit for single user |
| Real-time character streaming | Telegram API not designed for it, causes rate limit issues |
## Traceability
| Requirement | Phase | Status |
|-------------|-------|--------|
| MSG-01 | Phase 2 | Complete |
| MSG-02 | Phase 2 | Complete |
| MSG-03 | Phase 2 | Complete |
| MSG-04 | Phase 2 | Complete |
| SESS-01 | Phase 1 | Complete |
| SESS-02 | Phase 1 | Complete |
| SESS-03 | Phase 3 | Complete |
| SESS-04 | Phase 1 | Complete |
| LIFE-01 | Phase 3 | Complete |
| LIFE-02 | Phase 3 | Complete |
| LIFE-03 | Phase 3 | Complete |
| LIFE-04 | Phase 3 | Complete |
| OUT-01 | Phase 2 | Complete |
| OUT-02 | Phase 4 | Pending |
| OUT-03 | Phase 4 | Pending |
| INFRA-01 | Phase 2 | Complete |
| INFRA-02 | Phase 1 | Complete |
| INFRA-03 | Phase 1 | Complete |
**Coverage:**
- v1 requirements: 18 total
- Mapped to phases: 18
- Unmapped: 0
**Coverage validation:** All 18 v1 requirements mapped to exactly one phase. No orphans.
---
*Requirements defined: 2026-02-04*
*Last updated: 2026-02-04 after roadmap creation*

View file

@ -1,95 +0,0 @@
# Roadmap: Telegram Claude Code Bridge
## Overview
This project transforms Telegram into a mobile interface for Claude Code, enabling frictionless AI assistance from anywhere. The journey starts with session and subprocess foundations, integrates Telegram messaging with file handling, adds idle timeout management to conserve resources, and finishes with advanced output modes for power users.
## Phases
**Phase Numbering:**
- Integer phases (1, 2, 3): Planned milestone work
- Decimal phases (2.1, 2.2): Urgent insertions (marked with INSERTED)
Decimal phases appear between their surrounding integers in numeric order.
- [x] **Phase 1: Session & Process Foundation** - Multi-session filesystem structure with subprocess management
- [x] **Phase 2: Telegram Integration** - Core messaging loop with file handling and typing indicators
- [x] **Phase 3: Lifecycle Management** - Idle timeout, suspend/resume, and graceful cleanup
- [ ] **Phase 4: Output Modes** - Advanced output control for verbose and smart modes
## Phase Details
### Phase 1: Session & Process Foundation
**Goal**: Path-based sessions with isolated directories spawn and manage Claude Code subprocesses safely
**Depends on**: Nothing (first phase)
**Requirements**: SESS-01, SESS-02, SESS-04, INFRA-02, INFRA-03
**Success Criteria** (what must be TRUE):
1. User can create named sessions via `/new <name>` command
2. User can switch between sessions with `/session <name>` command
3. Each session has isolated directory at `~/telegram/sessions/<name>/` with metadata and conversation log
4. Claude Code subprocess spawns in session directory and processes input without pipe deadlock
5. Subprocess terminates cleanly on session switch with no zombie processes
**Plans:** 3 plans
Plans:
- [x] 01-01-PLAN.md -- Session manager module and persona library
- [x] 01-02-PLAN.md -- Claude Code subprocess engine
- [x] 01-03-PLAN.md -- Bot command integration (/new, /session, message routing)
### Phase 2: Telegram Integration
**Goal**: Messages flow bidirectionally between Telegram and Claude with file support and status feedback
**Depends on**: Phase 1
**Requirements**: MSG-01, MSG-02, MSG-03, MSG-04, OUT-01, INFRA-01
**Success Criteria** (what must be TRUE):
1. User sends message in Telegram and receives Claude's response back
2. Typing indicator appears in Telegram while Claude is processing (10-60s responses)
3. User sees brief progress notifications for tool calls (e.g. "Reading file...")
4. User attaches file/photo in Telegram and Claude can access it in session folder
5. Long responses split at 4096 char Telegram limit with proper code block handling
6. Bot runs as systemd user service and survives container restarts
**Plans:** 2 plans
Plans:
- [x] 02-01-PLAN.md -- Persistent subprocess engine + message formatting utilities
- [x] 02-02-PLAN.md -- Bot integration with batching, file handling, and systemd service
### Phase 3: Lifecycle Management
**Goal**: Sessions suspend automatically after idle period and resume transparently with full context
**Depends on**: Phase 2
**Requirements**: LIFE-01, LIFE-02, LIFE-03, LIFE-04, SESS-03
**Success Criteria** (what must be TRUE):
1. Session suspends automatically after configurable idle timeout (default 10 minutes)
2. User sends message to suspended session and conversation resumes with full history
3. User can change idle timeout via `/timeout <minutes>` command
4. User can list all sessions with last activity timestamp via `/sessions` command
5. Bot restart leaves no zombie processes (systemd KillMode handles cleanup)
**Plans:** 2 plans
Plans:
- [x] 03-01-PLAN.md -- Idle timer module + session metadata extensions + PID tracking
- [x] 03-02-PLAN.md -- Suspend/resume wiring, /timeout, /sessions, startup cleanup, graceful shutdown
### Phase 4: Output Modes
**Goal**: Users control response verbosity and format based on context
**Depends on**: Phase 3
**Requirements**: OUT-02, OUT-03
**Success Criteria** (what must be TRUE):
1. User enables `/verbose` mode and sees full Claude Code output streamed across multiple messages
2. User enables `/smart` mode and long outputs are sent as file attachments instead of truncation
3. Default mode (already from Phase 2) shows final answer with brief progress notifications
**Plans**: TBD
Plans:
- [ ] TBD
## Progress
**Execution Order:**
Phases execute in numeric order: 1 -> 2 -> 3 -> 4
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Session & Process Foundation | 3/3 | Complete | 2026-02-04 |
| 2. Telegram Integration | 2/2 | Complete | 2026-02-04 |
| 3. Lifecycle Management | 2/2 | Complete | 2026-02-04 |
| 4. Output Modes | 0/TBD | Not started | - |

View file

@ -1,100 +0,0 @@
# Project State
## Project Reference
See: .planning/PROJECT.md (updated 2026-02-04)
**Core value:** Frictionless conversation with Claude Code from anywhere via Telegram — no SSH, no manual inbox checking, just message and get a response.
**Current focus:** Phase 3 complete, ready for Phase 4
## Current Position
Phase: 3 of 4 (Lifecycle Management) — COMPLETE
Plan: 03-02 complete (2 of 2 plans completed)
Status: Complete
Last activity: 2026-02-04 — Completed Phase 3 (Lifecycle Management)
Progress: [███████████░░░░] 75%
## Performance Metrics
**Velocity:**
- Total plans completed: 7
- Average duration: 16 min
- Total execution time: 2.05 hours
**By Phase:**
| Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------|
| 1 | 3 | 27min | 9min |
| 2 | 2 | 95min | 48min |
| 3 | 2 | 6min | 3min |
**Recent Trend:**
- Last 3 plans: 02-02 (90min), 03-01 (2min), 03-02 (4min)
- Phase 3 maintaining fast execution: lightweight integration tasks
*Updated after each plan completion*
## Accumulated Context
### Decisions
Decisions are logged in PROJECT.md Key Decisions table.
Recent decisions affecting current work:
- Claude Code CLI over raw API: Gets full tool access, --resume support, and all Claude Code features for free
- Haiku polling + Opus conversation: Avoids burning expensive Opus tokens on idle monitoring (deferred to v2)
- Path-based sessions: Leverages Claude Code's native session-per-directory behavior, files naturally scoped
- Extend existing bot: Reuse proven Telegram integration rather than building from scratch
- Sessions created as 'idle', activated explicitly: Creating doesn't mean in use, switch required (01-01)
- Metadata read from disk on demand: No caching to avoid stale state (01-01)
- Asyncio.gather for concurrent stream reading: Prevents pipe deadlock (01-02)
- Fresh process per turn: Spawn new `claude -p` invocation for Phase 1 simplicity (01-02)
- Callback architecture: Decouple subprocess from session management via on_output/on_error/on_complete/on_status (01-02)
- Sibling imports over package imports: Avoids shadowing pip telegram package (01-03)
- Archive sessions with tar+pigz: Compression + cleanup to sessions_archive/ (01-03)
- Persistent subprocess instead of fresh per turn: Eliminates ~1s spawn overhead, maintains context (02-01)
- Split messages at 4000 chars (not 4096): Leaves room for MarkdownV2 escape expansion (02-01)
- Never split inside code blocks: Track in_code_block state, only split when safe (02-01)
- Dynamic typing event lookup: Callbacks reference typing_tasks dict by session name, not captured event (02-02)
- --append-system-prompt instead of --system-prompt: Preserves Claude Code model identity (02-02)
- --dangerously-skip-permissions: Full tool access in non-interactive subprocess (02-02)
- Full model ID in persona: Use claude-sonnet-4-5-20250929 instead of alias (02-02)
- Stream-json NDJSON format: {type: user, message: {role: user, content: text}} (02-02)
- Default 600s (10 min) idle timeout per session: Balances responsiveness with resource conservation (03-01)
- Timer reset via task cancellation: Cancel existing task, create new background sleep task (03-01)
- PID property returns live process ID only: None if terminated to prevent stale references (03-01)
- Silent suspension: No Telegram message when session auto-suspends (03-02, from CONTEXT.md)
- Switching sessions leaves previous subprocess running: It suspends on its own timer (03-02, from CONTEXT.md)
- Race prevention via per-session asyncio.Lock: Prevents concurrent suspend + resume on same session (03-02)
- Resume shows idle duration if >1 min: "Resuming session (idle for 15 min)..." (03-02)
- Orphaned PID verification via /proc/cmdline: Only kill claude processes at startup (03-02)
### Pending Todos
None yet.
### Blockers/Concerns
**Phase 1 (Session & Process Foundation) — COMPLETE**
- ~~Claude Code CLI --resume behavior with pipes vs PTY unknown~~ — RESOLVED
- ~~Output format for tool calls not documented~~ — RESOLVED
**Phase 2 (Telegram Integration) — COMPLETE**
- ~~Non-persistent process model (spawned fresh per turn)~~ — RESOLVED (02-01)
- ~~Message batching strategy needs validation~~ — RESOLVED: Works with 2s debounce (02-02)
- ~~File upload flow needs end-to-end testing~~ — RESOLVED: User-verified (02-02)
- ~~Typing indicator not visible despite API success~~ — RESOLVED: Stale task cleanup + dynamic event lookup (02-02)
- ~~Model identifies as wrong version~~ — RESOLVED: --append-system-prompt preserves CLI defaults (02-02)
**Phase 3 (Lifecycle Management) — COMPLETE**
- None
## Session Continuity
Last session: 2026-02-04T23:45:00Z
Stopped at: Completed Phase 3 (Lifecycle Management)
Resume file: None
Next: Phase 4 (Output Modes)

View file

@ -1,151 +0,0 @@
# Architecture
**Analysis Date:** 2026-02-04
## Pattern Overview
**Overall:** Hub-and-spoke service orchestration with API-driven infrastructure management.
**Key Characteristics:**
- Centralized management container (VMID 102 - mgmt) coordinating all infrastructure
- Layered abstraction: CLI helpers → REST APIs → external services
- Event-driven notifications (Telegram bot bridges management layer to user)
- Credential-based authentication for all service integrations
## Layers
**Management Layer:**
- Purpose: Orchestration and automation entry point for the homelab
- Location: `/home/mikkel/homelab` (git repository in mgmt container)
- Contains: CLI helper scripts (`~/bin/*`), Telegram bot, documentation
- Depends on: Remote SSH access to container/VM IP addresses, Proxmox API, service REST APIs
- Used by: Claude Code automation, Telegram bot commands, cron jobs
**API Integration Layer:**
- Purpose: Abstracts service APIs into simple CLI interfaces
- Location: `~/bin/` (pve, npm-api, dns, pbs, beszel, kuma, updates, telegram)
- Contains: Python and Bash wrappers around external service APIs
- Depends on: Proxmox API, Nginx Proxy Manager API, Technitium DNS API, PBS REST API, Beszel PocketBase, Uptime Kuma REST API, Telegram Bot API
- Used by: Telegram bot, CI/CD automation, interactive CLI usage
**Service Layer:**
- Purpose: Individual hosted services providing infrastructure capabilities
- Location: Distributed across containers (NPM, DNS, PBS, Dockge, Forgejo, etc.)
- Contains: Docker containers, LXC services, backup systems
- Depends on: PVE host networking, shared storage, external integrations
- Used by: API layer, end-user access via web UI or CLI
**Data & Communication Layer:**
- Purpose: State persistence and inter-service communication
- Location: Shared storage (`~/stuff` - ZFS bind mount), credential files (`~/.config/*/credentials`)
- Contains: Backup data, configuration files, Telegram inbox/images/files
- Depends on: PVE ZFS dataset, filesystem access
- Used by: All services, backup/restore operations
## Data Flow
**Infrastructure Query Flow (e.g., `pve list`):**
1. User invokes CLI helper: `~/bin/pve list`
2. Helper loads credentials from `~/.config/pve/credentials`
3. Helper authenticates to Proxmox API at `core.georgsen.dk:8006` using token auth
4. Proxmox returns cluster resource state (VMs/containers)
5. Helper formats and displays output to user
**Service Management Flow (e.g., `dns add myhost 10.5.0.50`):**
1. User invokes: `~/bin/dns add myhost 10.5.0.50`
2. DNS helper loads credentials and authenticates to Technitium at `10.5.0.2:5380`
3. Helper makes HTTP API call to add A record
4. Technitium stores in zone file and updates DNS records
5. Helper confirms success to user
**Backup Status Flow (e.g., `/pbs` command in Telegram):**
1. Telegram user sends `/pbs` command
2. Bot handler in `telegram/bot.py` executes `~/bin/pbs status`
3. PBS helper SSH's to `10.5.0.6` as root
4. SSH command reads backup logs and GC status from PBS container
5. Helper formats human-readable output
6. Bot sends result back to Telegram chat (truncated to 4000 chars for Telegram API limit)
**State Management:**
- Credentials: Stored in `~/.config/*/credentials` files (sourced at runtime)
- Telegram messages: Appended to `telegram/inbox` file for Claude to read
- Media uploads: Saved to `telegram/images/` and `telegram/files/` with timestamps
- Authorization: `telegram/authorized_users` file maintains allowlist of chat IDs
## Key Abstractions
**Helper Scripts (API Adapters):**
- Purpose: Translate user intent into remote service API calls
- Examples: `~/bin/pve`, `~/bin/dns`, `~/bin/pbs`, `~/bin/beszel`, `~/bin/kuma`
- Pattern: Load credentials → authenticate → execute command → format output
- Language: Mix of Python (pve, updates, telegram) and Bash (dns, pbs, beszel, kuma)
**Telegram Bot:**
- Purpose: Provides two-way interactive access to management functions
- Implementation: `telegram/bot.py` using python-telegram-bot library
- Pattern: Command handlers dispatch to helper scripts, results sent back to user
- Channels: Commands (e.g., `/pbs`), free-text messages saved to inbox, photos/files downloaded
**Service Registry (Documentation):**
- Purpose: Centralized reference for service locations and access patterns
- Implementation: `homelab-documentation.md` and `CLAUDE.md`
- Contents: IP addresses, ports, authentication methods, SSH targets, network topology
## Entry Points
**CLI Usage (Direct):**
- Location: `~/bin/{helper}` scripts
- Triggers: Manual invocation by user or cron jobs
- Responsibilities: Execute service operations, format output, validate inputs
**Telegram Bot:**
- Location: `telegram/bot.py` (systemd service: `telegram-bot.service`)
- Triggers: Telegram message or command from authorized user
- Responsibilities: Authenticate user, route command/message, execute via helper scripts, send response
**Automation Scripts:**
- Location: Potential cron jobs or scheduled tasks
- Triggers: Time-based scheduling
- Responsibilities: Execute periodic management tasks (e.g., backup checks, updates)
**Manual Execution:**
- Location: Interactive shell in mgmt container
- Triggers: User SSH session
- Responsibilities: Run helpers for ad-hoc infrastructure management
## Error Handling
**Strategy:** Graceful degradation with informative messaging.
**Patterns:**
- CLI helpers return non-zero exit codes on failure (exception handling in Python, `set -e` in Bash)
- Timeout protection: Telegram bot commands have 30-second timeout (configurable per command)
- Service unavailability: Caught in try/except blocks, fall back to next option (e.g., `pve` tries LXC first, then QEMU)
- Credential failures: Load-time validation, clear error message if credentials file missing
- Network errors: SSH timeouts, API connection failures logged to stdout/stderr
## Cross-Cutting Concerns
**Logging:**
- Telegram bot uses Python stdlib logging (INFO level, writes to systemd journal)
- CLI helpers write directly to stdout/stderr
- PBS helper uses SSH error output for remote command failures
**Validation:**
- Telegram bot validates hostnames (alphanumeric + dots + hyphens only) before ping
- DNS helper validates that name and IP are provided before API call
- PVE helper validates VMID is integer before API call
**Authentication:**
- Credentials stored in `~/.config/{service}/credentials` as simple key=value files
- Sourced at runtime (Bash) or read at startup (Python)
- Token-based auth for Proxmox (no password in memory)
- Basic auth for DNS and other REST APIs (credentials URL-encoded if needed)
- Bearer token for Uptime Kuma (API key-based)
---
*Architecture analysis: 2026-02-04*

View file

@ -1,272 +0,0 @@
# Codebase Concerns
**Analysis Date:** 2026-02-04
## Tech Debt
**IP Addressing Scheme Inconsistency:**
- Issue: Container IPs don't follow VMID convention. NPM (VMID 100) is at .1, Dockge (VMID 101) at .10, PBS (VMID 106) at .6, instead of matching .100, .101, .106
- Files: `homelab-documentation.md` (lines 139-159)
- Impact: Manual IP tracking required, DNS records must be maintained separately, new containers require manual IP assignment planning, documentation drift risk
- Fix approach: Execute TODO task to reorganize vmbr1 to VMID=IP scheme (.100-.253 range), update NPM proxy hosts, DNS records (lab.georgsen.dk), and documentation
**DNS Record Maintenance Manual:**
- Issue: Internal DNS (Technitium) and external DNS (dns.services) require manual updates when IPs/domains change
- Files: `homelab-documentation.md` (lines 432-449), `~/bin/dns` script
- Impact: Risk of records becoming stale after IP migrations, no automation for new containers
- Fix approach: Implement `dns-services` helper script (TODO.md line 27) with API integration for automatic updates
**Unimplemented Helper Scripts:**
- Issue: `dns-services` API integration promised in TODO but not implemented
- Files: `TODO.md` (line 27), `dns-services/credentials` exists but script doesn't
- Impact: Manual dns.services operations required, cannot automate domain setup
- Fix approach: Create `~/bin/dns-services` wrapper (endpoint documented in TODO)
**Ping Capability Missing on 12 Containers:**
- Issue: Unprivileged LXC containers drop cap_net_raw, breaking ping on VMIDs 100, 101, 102, 103, 104, 105, 107, 108, 110, 111, 112, 114, 115, 1000
- Files: `TODO.md` (lines 31-33), `CLAUDE.md` (line 252-255)
- Impact: Health monitoring fails, network diagnostics broken, Telegram bot status checks incomplete (bot has no ping on home network itself), Uptime Kuma monitors may show false negatives
- Fix approach: Run `setcap cap_net_raw+ep /bin/ping` on each container (must be reapplied after iputils-ping updates)
**Version Pinning Warnings:**
- Issue: CLAUDE.md section 227-241 warns about hardcoded versions becoming stale
- Files: `homelab-documentation.md` (lines 217, 228, 239), `~/bin/updates` script shows version checking is implemented but some configs have `latest` tags
- Impact: Security patch delays, incompatibilities when manually deploying services
- Fix approach: Always query GitHub API for latest versions (updates script does this correctly for discovery phase)
## Known Bugs
**Telegram Bot Inbox Storage Race Condition:**
- Symptoms: Concurrent message writes could corrupt inbox file, messages may be lost
- Files: `telegram/bot.py` (lines 39, 200-220 message handling), `~/bin/telegram` (lines 73-79 clear command)
- Trigger: Multiple rapid messages from admin or concurrent bot operations
- Workaround: Clear inbox frequently and check for corruption; bot currently appends to file without locking
- Root cause: File-based inbox with no atomic writes or mutex protection
**PBS Backup Mount Dependency Not Enforced:**
- Symptoms: PBS services may start before Synology CIFS mount is available, backup path unreachable
- Files: `homelab-documentation.md` (lines 372-384), container 106 config
- Trigger: System reboot when Tailscale connectivity is delayed
- Workaround: Manual restart of proxmox-backup-proxy and proxmox-backup services
- Root cause: systemd dependency chain `After=mnt-synology.mount` doesn't guarantee mount is ready at service start time
**DragonflyDB Password in Plain Text in Documentation:**
- Symptoms: Database password visible in compose file and documentation
- Files: `homelab-documentation.md` (lines 248-250)
- Trigger: Anyone reading docs or inspecting git history
- Workaround: Consider password non-critical if container only accessible on internal network
- Root cause: Password stored in version control and documentation rather than .env or secrets file
**NPM Proxy Host 18 (mh.datalos.dk) Not Configured:**
- Symptoms: Domain not resolving despite DNS record missing and NPM entry (ID 18) mentioned in TODO
- Files: `TODO.md` (line 29), `homelab-documentation.md` (proxy hosts section)
- Trigger: Accessing mh.datalos.dk from browser
- Workaround: Must be configured manually via NPM web UI
- Root cause: Setup referenced in TODO but not completed
## Security Considerations
**Exposed Credentials in Git History:**
- Risk: Credential files committed (credentials, SSH keys, telegram token examples)
- Files: All credential files in `telegram/`, `pve/`, `forgejo/`, `dns/`, `dockge/`, `uptime-kuma/`, `beszel/`, `dns-services/` directories (8+ files)
- Current mitigation: Files are .gitignored in main repo but present in working directory
- Recommendations: Rotate all credentials listed, audit git log for historical commits, use HashiCorp Vault or pass for credential storage, document secret rotation procedure
**Public IP Hardcoded in Documentation:**
- Risk: Home IP 83.89.248.247 exposed in multiple locations
- Files: `homelab-documentation.md` (lines 98, 102), `CLAUDE.md` (line 256)
- Current mitigation: IP is already public/static, used for whitelist access
- Recommendations: Document that whitelisting this IP is intentional, no other PII mixed in
**Telegram Bot Authorization Model Too Permissive:**
- Risk: First user to message bot becomes admin automatically with no verification
- Files: `telegram/bot.py` (lines 86-95)
- Current mitigation: Bot only responds to authorized user, requires bot discovery
- Recommendations: Require multi-factor authorization on first start (e.g., PIN from environment variable), implement audit logging of all bot commands
**Database Credentials in Environment Variables:**
- Risk: DragonflyDB password passed via Docker command line (visible in `docker ps`, logs, process listings)
- Files: `homelab-documentation.md` (line 248)
- Current mitigation: Container only accessible on internal vmbr1 network
- Recommendations: Use Docker secrets or mounted .env files instead of command-line arguments
**Synology CIFS Credentials in fstab:**
- Risk: SMB credentials stored in plaintext in fstab file with mode 0644 (world-readable)
- Files: `homelab-documentation.md` (line 369)
- Current mitigation: Mounted on container-only network, requires PBS container access
- Recommendations: Use credentials file with mode 0600, rotate credentials regularly, monitor file permissions
**SSH Keys Included in Documentation:**
- Risk: Public SSH keys hardcoded in CLAUDE.md setup examples
- Files: `CLAUDE.md` and `homelab-documentation.md` SSH key examples
- Current mitigation: Public keys only (not private), used for container access
- Recommendations: Rotate these keys if documentation is ever exposed, don't include in public repos
## Performance Bottlenecks
**Single NVMe Storage (RAID0) Without Local Redundancy:**
- Problem: Core server has 2x1TB NVMe in RAID0 (striped, no redundancy)
- Files: `homelab-documentation.md` (lines 17-24)
- Cause: Cost optimization for Hetzner dedicated server
- Impact: Single drive failure = total data loss; database corruption risk from RAID0 stripe inconsistency
- Improvement path: (1) Ensure PBS backups run successfully to Synology, (2) Test backup restore procedure monthly, (3) Plan upgrade path if budget allows (3-way mirror or RAID1)
**Backup Dependency on Single Tailscale Gateway:**
- Problem: All PBS backups to Synology go through Tailscale relay (10.5.0.134), single point of failure
- Files: `homelab-documentation.md` (lines 317-427)
- Cause: Synology only accessible via Tailscale network, relay container required
- Impact: Tailscale relay downtime = backup failure; no local backup option
- Improvement path: (1) Add second Tailscale relay for redundancy, (2) Explore PBS direct SSH backup mode, (3) Monitor relay container health
**DNS Queries All Route Through Single Technitium Container:**
- Problem: All internal DNS (lab.georgsen.dk) goes through container 115, DHCP defaults to this server
- Files: `homelab-documentation.md` (lines 309-315), container config
- Cause: Single container architecture
- Impact: DNS outage = network unreachable (containers can't resolve any hostnames)
- Improvement path: (1) Deploy DNS replica on another container, (2) Configure DHCP to use multiple DNS servers, (3) Set upstream DNS fallback
**Script Execution via Telegram Bot with Subprocess Timeout:**
- Problem: Bot runs helper scripts with 30-second timeout, commands like PBS backup query can exceed limit
- Files: `telegram/bot.py` (lines 60-78, 191)
- Cause: Helper scripts do remote SSH execution, network latency variable
- Impact: Commands truncated mid-execution, incomplete status reports, timeouts on slow networks
- Improvement path: Increase timeout selectively, implement command queuing, cache results for frequently-called commands
## Fragile Areas
**Installer Shell Script with Unimplemented Sections:**
- Files: `pve-homelab-kit/install.sh` (495+ lines with TODO comments)
- Why fragile: Multiple TODO placeholders indicate incomplete implementation; wizard UI done but ~30 implementation TODOs remain
- Safe modification: (1) Don't merge branches without running through full install, (2) Test each section independently, (3) Add shell `set -e` error handling
- Test coverage: Script has no tests, no dry-run mode, no rollback capability
**Container Configuration Manual in LXC Config Files:**
- Files: `/etc/pve/lxc/*.conf` across Proxmox host (not in repo, not version controlled)
- Why fragile: Critical settings (features, ulimits, AppArmor) outside version control, drift risk after manual fixes
- Safe modification: Keep backup copies in `homelab-documentation.md` (already done for PBS), automate via Terraform/Ansible if future containers added
- Test coverage: Config changes only tested on live container (no staging env)
**Helper Scripts with Hardcoded IPs and Paths:**
- Files: `~/bin/updates` (lines 16-17, 130), `~/bin/pbs`, `~/bin/pve`, `~/bin/dns`
- Why fragile: DOCKGE_HOST, PVE_HOST hardcoded; if IPs change during migration, all scripts must be updated manually
- Safe modification: Extract to config file (e.g., `/etc/homelab/config.sh` or environment variables)
- Test coverage: Scripts tested against live infrastructure only
**SSH-Based Container Access Without Key Verification:**
- Files: `~/bin/updates` (lines 115-131), scripts use `-q` flag suppressing host key checks
- Why fragile: `ssh -q` disables StrictHostKeyChecking, vulnerable to MITM; scripts assume SSH keys are pre-installed
- Safe modification: Add `-o StrictHostKeyChecking=accept-new` to verify on first connection, document key distribution procedure
- Test coverage: SSH connectivity assumed working
**Backup Monitoring Without Alerting on Failure:**
- Files: `~/bin/pbs`, `telegram/bot.py` (status command only, no automatic failure alerts)
- Why fragile: Failed backups only visible if manually checked; no monitoring of backup completion
- Safe modification: Add systemd timer to check PBS status hourly, send Telegram alert on failure
- Test coverage: Manual checks only
## Scaling Limits
**Container IP Space Exhaustion:**
- Current capacity: vmbr1 is /24 (256 IPs, .0-.255), DHCP range .100-.200 (101 IPs available for DHCP), static IPs scattered
- Limit: After ~150 containers, IP fragmentation becomes difficult to manage; DHCP range conflicts with static allocation
- Scaling path: (1) Implement TODO IP scheme (VMID=IP), (2) Expand to /23 (512 IPs) if more containers needed, (3) Use vmbr2 (vSwitch) for secondary network
**Backup Datastore Single Synology Volume:**
- Current capacity: Synology `pbs-backup` share unknown size (not documented)
- Limit: Unknown when share becomes full; no warning system implemented
- Scaling path: (1) Document share capacity in homelab-documentation.md, (2) Add usage monitoring to `beszel` or Uptime Kuma, (3) Plan expansion to second NAS
**Dockge Stack Limit:**
- Current capacity: Dockge container 101 running ~8-10 stacks visible in documentation
- Limit: No documented resource constraints; may hit CPU/RAM limits on Hetzner AX52 with more containers
- Scaling path: (1) Monitor Dockge resource usage via Beszel, (2) Profile Dragonfly memory usage, (3) Plan VM migration for heavy workloads
**DNS Query Throughput:**
- Current capacity: Single Technitium container handling all internal DNS
- Limit: Container CPU/RAM limits unknown; no QPS monitoring
- Scaling path: (1) Add DNS replica, (2) Monitor query latency, (3) Profile Technitium logs for slow queries
## Dependencies at Risk
**Technitium DNS (Unmaintained Risk):**
- Risk: TechnitiumSoftware/DnsServer has irregular commit history; last significant release early 2024
- Impact: Security fixes may be delayed; compatibility with newer Linux kernels unknown
- Migration plan: (1) Profile current Technitium features used, (2) Evaluate CoreDNS or Dnsmasq alternatives, (3) Plan gradual migration with dual DNS
**DragonflyDB as Redis Replacement:**
- Risk: Dragonfly smaller ecosystem than Redis; breaking changes possible in minor updates
- Impact: Applications expecting Redis behavior may fail; less community support for issues
- Migration plan: (1) Pin Dragonfly version in compose file (currently `latest`), (2) Test upgrades in dev environment, (3) Document any API incompatibilities found
**Dockge (Single Maintainer Project):**
- Risk: Dockge maintained by one developer (louislam); bus factor high
- Impact: If maintainer loses interest, fixes and features stop; dependency on their release schedule
- Migration plan: (1) Use Dockge for UI only, don't depend on it for production orchestration, (2) Keep docker-compose expertise on team, (3) Consider Portainer as fallback alternative
**Forgejo (Younger than Gitea):**
- Risk: Forgejo is recent fork of Gitea; database schema changes possible in patch versions
- Impact: Upgrades may require manual migrations; data loss risk if migration fails
- Migration plan: (1) Test Forgejo upgrades on backup copy first, (2) Document upgrade procedure, (3) Keep Gitea as fallback if Forgejo breaks
## Missing Critical Features
**No Automated Health Monitoring/Alerting:**
- Problem: Status checks exist (via Telegram bot, Uptime Kuma) but no automatic alerts when services fail
- Blocks: Cannot sleep soundly; must manually check status to detect outages
- Implementation path: (1) Add Uptime Kuma HTTP monitors for all public services, (2) Create Telegram alert webhook, (3) Monitor PBS backup success daily
**No Automated Certificate Renewal Verification:**
- Problem: NPM handles Let's Encrypt renewal, but no monitoring for renewal failures
- Blocks: Certificates could expire silently; discovered during service failures
- Implementation path: (1) Add Uptime Kuma alert for HTTP 200 on https://* services, (2) Add monthly certificate expiry check, (3) Set up renewal failure alerts
**No Disaster Recovery Runbook:**
- Problem: Procedures for rescuing locked-out server (Hetzner Rescue Mode) not documented
- Blocks: If SSH access lost, cannot recover without external procedures
- Implementation path: (1) Document Hetzner Rescue Mode recovery steps, (2) Create network reconfiguration backup procedures, (3) Test rescue mode monthly
**No Change Log / Audit Trail:**
- Problem: Infrastructure changes not logged; drift from documentation occurs silently
- Blocks: Unknown who made changes, when, and why; cannot track config evolution
- Implementation path: (1) Add git commit requirement for all manual changes, (2) Create change notification to Telegram, (3) Weekly drift detection report
**No Secrets Management System:**
- Problem: Credentials scattered across plaintext files, git history, and documentation
- Blocks: Cannot safely share access with team members; no credential rotation capability
- Implementation path: (1) Deploy HashiCorp Vault or Vaultwarden, (2) Migrate all secrets to vault, (3) Create credential rotation procedures
## Test Coverage Gaps
**PBS Backup Restore Not Tested:**
- What's not tested: Full restore procedures; assumed to work but never verified
- Files: `homelab-documentation.md` (lines 325-392), no restore test documented
- Risk: If restore needed, may discover issues during actual data loss emergency
- Priority: HIGH - Add monthly restore test procedure (restore single VM to temporary location, verify data integrity)
**Network Failover Scenarios:**
- What's not tested: What happens if Tailscale relay (1000) goes down, if NPM container restarts, if DNS returns SERVFAIL
- Files: No documented failure scenarios
- Risk: Unknown recovery time; applications may hang instead of failing gracefully
- Priority: HIGH - Document and test each service's failure mode
**Helper Script Error Handling:**
- What's not tested: Scripts with SSH timeouts, host unreachable, malformed responses
- Files: `~/bin/updates`, `~/bin/pbs`, `~/bin/pve` (error handling exists but not tested against failures)
- Risk: Silent failures could go unnoticed; incomplete output returned to caller
- Priority: MEDIUM - Add error injection tests (mock SSH failures)
**Telegram Bot Commands Under Load:**
- What's not tested: Bot response when running concurrent commands, or when helper scripts timeout
- Files: `telegram/bot.py` (no load tests, concurrency behavior unknown)
- Risk: Bot may hang or lose messages under heavy load
- Priority: MEDIUM - Add load test with 10+ concurrent commands
**Container Migration (VMID IP Scheme Change):**
- What's not tested: Migration of 15+ containers to new IP scheme; full rollback procedures
- Files: `TODO.md` (line 5-15, planned but not executed)
- Risk: Single IP misconfiguration could take multiple services offline
- Priority: HIGH - Create detailed migration runbook with rollback at each step before executing
---
*Concerns audit: 2026-02-04*

View file

@ -1,274 +0,0 @@
# Coding Conventions
**Analysis Date:** 2026-02-04
## Naming Patterns
**Files:**
- Python files: lowercase with underscores (e.g., `bot.py`, `credentials`)
- Bash scripts: lowercase with hyphens (e.g., `npm-api`, `uptime-kuma`)
- Helper scripts in `~/bin/`: all lowercase, no extension (e.g., `pve`, `pbs`, `dns`)
**Functions:**
- Python: snake_case (e.g., `cmd_status()`, `get_authorized_users()`, `run_command()`)
- Bash: snake_case with `cmd_` prefix for command handlers (e.g., `cmd_status()`, `cmd_tasks()`)
- Bash: auxiliary functions also use snake_case (e.g., `ssh_pbs()`, `get_token()`)
**Variables:**
- Python: snake_case for local/module vars (e.g., `authorized_users`, `output_lines`)
- Python: UPPERCASE for constants (e.g., `TOKEN`, `INBOX_FILE`, `AUTHORIZED_FILE`, `NODE`, `PBS_HOST`)
- Bash: UPPERCASE for environment variables and constants (e.g., `PBS_HOST`, `TOKEN`, `BASE`, `DEFAULT_ZONE`)
- Bash: lowercase for local variables (e.g., `hours`, `cutoff`, `status_icon`)
**Types/Classes:**
- Python: PascalCase for imported classes (e.g., `ProxmoxAPI`, `Update`, `Application`)
- Dictionary/config keys: lowercase with hyphens or underscores (e.g., `token_name`, `max-mem`)
## Code Style
**Formatting:**
- No automated formatter detected in codebase
- Python: PEP 8 conventions followed informally
- 4-space indentation
- Max line length ~90-100 characters (observed in practice)
- Blank lines: 2 lines before module-level functions, 1 line before methods
- Bash: 4-space indentation (observed)
**Linting:**
- No linting configuration detected (no .pylintrc, .flake8, .eslintrc)
- Code style is manually maintained
**Docstrings:**
- Python: Triple-quoted strings at module level describing purpose
- Example from `telegram/bot.py`:
```python
"""
Homelab Telegram Bot
Two-way interactive bot for homelab management and notifications.
"""
```
- Python: Function docstrings used for major functions
- Single-line format for simple functions
- Example: `"""Handle /start command - first contact with bot."""`
- Example: `"""Load authorized user IDs."""`
## Import Organization
**Order:**
1. Standard library imports (e.g., `sys`, `os`, `json`, `subprocess`)
2. Third-party imports (e.g., `ProxmoxAPI`, `telegram`, `pocketbase`)
3. Local imports (rarely used in this codebase)
**Path Aliases:**
- No aliases detected
- Absolute imports used throughout
**Credential Loading Pattern:**
All scripts that need credentials follow the same pattern:
```python
# Load credentials
creds_path = Path.home() / ".config" / <service> / "credentials"
creds = {}
with open(creds_path) as f:
for line in f:
if '=' in line:
key, value = line.strip().split('=', 1)
creds[key] = value
```
Or in Bash:
```bash
source ~/.config/dns/credentials
```
## Error Handling
**Patterns:**
- Python: Try-except with broad exception catching (bare `except:` used in `pve` script lines 70, 82, 95, 101)
- Not ideal but pragmatic for CLI tools that need to try multiple approaches
- Example from `pve`:
```python
try:
status = pve.nodes(NODE).lxc(vmid).status.current.get()
# ...
return
except:
pass
```
- Python: Explicit exception handling in telegram bot
- Catches `subprocess.TimeoutExpired` specifically in `run_command()` function
- Example from `telegram/bot.py`:
```python
try:
result = subprocess.run(...)
output = result.stdout or result.stderr or "No output"
if len(output) > 4000:
output = output[:4000] + "\n... (truncated)"
return output
except subprocess.TimeoutExpired:
return "Command timed out"
except Exception as e:
return f"Error: {e}"
```
- Bash: Set strict mode with `set -e` in some scripts (`dns` script line 12)
- Causes script to exit on first error
- Bash: No error handling in most scripts (`pbs`, `beszel`, `kuma`)
- Relies on exit codes implicitly
**Return Value Handling:**
- Python: Functions return data directly or None on failure
- Example from `pbs` helper: Returns JSON-parsed data or string output
- Example from `pve`: Returns nothing (prints output), but uses exceptions for flow control
- Python: Command runner returns error strings: `"Command timed out"`, `"Error: {e}"`
## Logging
**Framework:**
- Python: Standard `logging` module
- Configured in `telegram/bot.py` lines 18-22:
```python
logging.basicConfig(
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
level=logging.INFO
)
logger = logging.getLogger(__name__)
```
- Log level: INFO
- Format includes timestamp, logger name, level, message
**Patterns:**
- `logger.info()` for general informational messages
- Example: `logger.info("Starting Homelab Bot...")`
- Example: `logger.info(f"Inbox message from {user.first_name}: {message[:50]}...")`
- Example: `logger.info(f"Photo saved from {user.first_name}: {filepath}")`
- Bash: Uses `echo` for output, no structured logging
- Informational messages for user feedback
- Error messages sent to stdout (not stderr)
## Comments
**When to Comment:**
- Module-level docstrings at top of file (required for all scripts)
- Usage examples in module docstrings (e.g., `pve`, `pbs`, `kuma`)
- Inline comments for complex logic (e.g., in `pbs` script parsing hex timestamps)
- Comments on tricky regex patterns (e.g., `pbs` tasks parsing)
**Bash Comments:**
- Header comment with script name, purpose, and usage (lines 1-10)
- Inline comments before major sections (e.g., `# Datastore info`, `# Storage stats`)
- No comments in simple expressions
**Python Comments:**
- Header comment with purpose (module docstring)
- Sparse inline comments except for complex sections
- Example from `telegram/bot.py` line 71: `# Telegram has 4096 char limit per message`
- Example from `pve` line 70: `# Try as container first`
## Function Design
**Size:**
- Python: Functions are generally 10-50 lines
- Smaller functions for simple operations (e.g., `is_authorized()` is 2 lines)
- Larger functions for command handlers that do setup + API calls (e.g., `status()` is 40 lines)
- Bash: Functions are typically 20-80 lines
- Longer functions acceptable for self-contained operations like `cmd_status()` in `pbs`
**Parameters:**
- Python: Explicit parameters, typically 1-5 parameters per function
- Optional parameters with defaults (e.g., `timeout: int = 30`, `port=45876`)
- Type hints not used consistently (some functions have them, many don't)
- Bash: Parameters passed as positional arguments
- Some functions take zero parameters and rely on global variables
- Example: `ssh_pbs()` in `pbs` uses global `$PBS_HOST`
**Return Values:**
- Python: Functions return data (strings, dicts, lists) or None
- Command handlers often return nothing (implicitly None)
- Helper functions return computed values (e.g., `is_authorized()` returns bool)
- Bash: Functions print output directly, return exit codes
- No explicit return values beyond exit codes
- Output captured by caller with `$()`
## Module Design
**Exports:**
- Python: All functions are module-level, no explicit exports
- `if __name__ == "__main__":` pattern used in all scripts to guard main execution
- Example from `beszel` lines 101-152
- Bash: All functions are script-level, called via case statement
- Main dispatch logic at bottom of script
- Example from `dns` lines 29-106: `case "$1" in ... esac`
**Async/Await (Telegram Bot Only):**
- Python telegram bot uses `asyncio` and `async def` for all handlers
- All command handlers are async (e.g., `async def start()`)
- Use `await` for async operations (e.g., `await update.message.reply_text()`)
- Example from `telegram/bot.py` lines 81-94:
```python
async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Handle /start command - first contact with bot."""
user = update.effective_user
chat_id = update.effective_chat.id
# ... async operations with await
```
**File Structure:**
- Single-file modules: Most helpers are single files
- `telegram/bot.py`: Main bot implementation with all handlers
- `/bin/` scripts: Each script is self-contained with helper functions + main dispatch
## Data Structures
**JSON/Config Files:**
- Credentials files: Simple `KEY=value` format (no JSON)
- PBS task logging: Uses hex-encoded UPID format, parsed with regex
- Telegram bot: Saves messages to text files with timestamp prefix
- JSON output: Parsed with `python3 -c "import sys, json; ..."` in Bash scripts
**Error Response Patterns:**
- API calls: Check for `.get('status') == 'ok'` or similar
- Command execution: Check `returncode == 0`, capture stdout/stderr
- API clients: Let exceptions bubble up, caught at command handler level
## Conditionals and Flow Control
**Python:**
- if/elif/else chains for command dispatch
- Simple truthiness checks: `if not user_id:`, `if not alerts:`
- Example from `telegram/bot.py` line 86-100: Authorization check pattern
**Bash:**
- case/esac for command dispatch (preferred)
- if [[ ]] with regex matching for parsing
- Example from `pbs` lines 122-143: Complex regex with BASH_REMATCH array
## Security Patterns
**Credential Management:**
- Credentials stored in `~/.config/<service>/credentials` with restricted permissions (not enforced in code)
- Telegram token loaded from file, not environment
- Credentials never logged or printed
**Input Validation:**
- Bash: Basic validation with isalnum() check in `ping_host()` function
- Example: `if not host.replace('.', '').replace('-', '').isalnum():`
- Bash: Whitelist command names from case statements
- No SQL injection risk (no databases used directly)
**Shell Injection:**
- Bash scripts use quoted variables appropriately
- Some inline Python in Bash uses string interpolation (potential risk)
- Example from `dns` lines 31-37: `curl ... | python3 -c "..."` with variable interpolation
---
*Convention analysis: 2026-02-04*

View file

@ -1,261 +0,0 @@
# External Integrations
**Analysis Date:** 2026-02-04
## APIs & External Services
**Hypervisor Management:**
- **Proxmox VE (PVE)** - Cluster/node management
- SDK/Client: `proxmoxer` v2.2.0 (Python)
- Auth: Token-based (`root@pam!mgmt` token)
- Config: `~/.config/pve/credentials`
- Helper: `~/bin/pve` (list, status, start, stop, create-ct)
- Endpoint: https://65.108.14.165:8006 (local host core.georgsen.dk)
**Backup Management:**
- **Proxmox Backup Server (PBS)** - Centralized backup infrastructure
- API: REST over HTTPS at 10.5.0.6:8007
- Auth: Token-based (`root@pam!pve` token)
- Helper: `~/bin/pbs` (status, backups, tasks, errors, gc, snapshots, storage)
- Targets: core.georgsen.dk, pve01.warradejendomme.dk, pve02.warradejendomme.dk namespaces
- Datastore: Synology NAS via CIFS at 100.105.26.130 (Tailscale)
**DNS Management:**
- **Technitium DNS** - Internal DNS with API
- API: REST at http://10.5.0.2:5380/api/
- Auth: Username/password based
- Config: `~/.config/dns/credentials`
- Helper: `~/bin/dns` (list, records, add, delete, lookup)
- Internal zone: `lab.georgsen.dk`
- Upstream: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9)
**Monitoring APIs:**
- **Uptime Kuma** - Status page & endpoint monitoring
- API: HTTP at 10.5.0.10:3001
- SDK/Client: `uptime-kuma-api` v1.2.1 (Python)
- Auth: Username/password login
- Config: `~/.config/uptime-kuma/credentials`
- Helper: `~/bin/kuma` (list, info, add-http, add-port, add-ping, delete, pause, resume)
- URL: https://status.georgsen.dk
- **Beszel** - Server metrics dashboard
- Backend: PocketBase REST API at 10.5.0.10:8090
- SDK/Client: `pocketbase` v0.15.0 (Python)
- Auth: Admin email/password
- Config: `~/.config/beszel/credentials`
- Helper: `~/bin/beszel` (list, status, add, delete, alerts)
- URL: https://dashboard.georgsen.dk
- Agents: core (10.5.0.254), PBS (10.5.0.6), Dockge (10.5.0.10 + Docker stats)
- Data retention: 30 days (automatic)
**Reverse Proxy & SSL:**
- **Nginx Proxy Manager (NPM)** - Reverse proxy with SSL
- API: JSON-RPC style (internal Docker API)
- Helper: `~/bin/npm-api` (--host-list, --host-create, --host-delete, --cert-list)
- Config: `~/.config/npm/npm-api.conf` (custom API wrapper)
- UI: http://10.5.0.1:81 (admin panel)
- SSL Provider: Let's Encrypt (HTTP-01 challenge)
- Access Control: NPM Access Lists (ID 1: "home_only" whitelist 83.89.248.247)
**Git/Version Control:**
- **Forgejo** - Self-hosted Git server
- API: REST at 10.5.0.14:3000/api/v1/
- Auth: API token based
- Config: `~/.config/forgejo/credentials`
- URL: https://git.georgsen.dk
- Repo: `git@10.5.0.14:mikkel/homelab.git`
- Version: v10.0.1
**Data Stores:**
- **DragonflyDB** - Redis-compatible in-memory store
- Host: 10.5.0.10 (Docker in Dockge)
- Port: 6379
- Protocol: Redis protocol
- Auth: Password protected (`nUq/IfoIQJf/kouckKHRQOk7vV0NwCuI`)
- Client: redis-cli or any Redis library
- Usage: Session/cache storage
- **PostgreSQL** - Relational database
- Host: 10.5.0.109 (VMID 103)
- Default port: 5432
- Managed by: Community (Proxmox LXC community images)
- Usage: Sentry system and other applications
## Data Storage
**Databases:**
- **PostgreSQL 13+** (VMID 103)
- Connection: `postgresql://user@10.5.0.109:5432/dbname`
- Client: psql (CLI) or any PostgreSQL driver
- Usage: Sentry defense intelligence system, application databases
- **DragonflyDB** (Redis-compatible)
- Connection: `redis://10.5.0.10:6379` (with auth)
- Client: redis-cli or Python redis library
- Backup: Enabled in Docker config, persists to `./data/`
- **Redis** (VMID 104, deprecated in favor of DragonflyDB)
- Host: 10.5.0.111
- Status: Still active but DragonflyDB preferred
**File Storage:**
- **Local Filesystem:** Each container has ZFS subvolume storage at /
- **Shared Storage (ZFS):** `/shared/mikkel/stuff` bind-mounted into containers
- PVE: `rpool/shared/mikkel` dataset
- mgmt (102): `~/stuff` with backup=1 (included in PBS backups)
- dev (111): `~/stuff` (shared access)
- general (113): `~/stuff` (shared access)
- SMB Access: `\\mgmt\stuff` via Tailscale MagicDNS
**Backup Target:**
- **Synology NAS** (home network)
- Tailscale IP: 100.105.26.130
- Mount: `/mnt/synology` on PBS
- Protocol: CIFS/SMB 3.0
- Share: `/volume1/pbs-backup`
- UID mapping: Mapped to admin (squash: map all)
## Authentication & Identity
**Auth Providers:**
- **Proxmox PAM** - System-based authentication for PVE/PBS
- Users: root@pam, other system users
- Token auth: `root@pam!mgmt` (PVE), `root@pam!pve` (PBS)
**SSH Key Authentication:**
- **Ed25519 keys** for user access
- Key: `ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIOQrK06zVkfY6C1ec69kEZYjf8tC98icCcBju4V751i mikkel@georgsen.dk`
- Deployed to all containers at `~/.ssh/authorized_keys` and `/root/.ssh/authorized_keys`
**Telegram Bot Authentication:**
- **Telegram Bot Token** - Stored in `~/telegram/credentials`
- **Authorized Users:** Whitelist stored in `~/telegram/authorized_users` (chat IDs)
- **First user:** Auto-authorized on first `/start` command
- **Two-way messaging:** Text/photos/files saved to `~/telegram/inbox`
## Monitoring & Observability
**Error Tracking:**
- **Sentry** (custom defense intelligence system, VMID 105)
- Purpose: Monitor military contracting opportunities
- Databases: PostgreSQL (103) + Redis (104)
- Not a traditional error tracker - custom business intelligence system
**Metrics & Monitoring:**
- **Beszel**: Server CPU, RAM, disk usage metrics
- **Uptime Kuma**: HTTP, TCP port, ICMP ping monitoring
- **PBS**: Backup task logs, storage metrics, dedup stats
**Logs:**
- **PBS logs:** SSH queries via `~/bin/pbs`, stored on PBS container
- **Forgejo logs:** `/var/lib/forgejo/log/forgejo.log` (for fail2ban)
- **Telegram bot logs:** stdout to systemd service `telegram-bot.service`
- **Helper scripts:** Output to stdout, can be piped/redirected
## CI/CD & Deployment
**Hosting:**
- **Hetzner** (public cloud) - Primary: core.georgsen.dk (AX52)
- **Home Infrastructure** - Synology NAS for backups, future NUC cluster
- **Docker/Dockge** - Application deployment via Docker Compose (10.5.0.10)
**CI Pipeline:**
- **None detected** - Manual deployment via Dockge or container management
- **Version control:** Forgejo (self-hosted Git server)
- **Update checks:** `~/bin/updates` script checks for updates across services
- Tracked: dragonfly, beszel, uptime-kuma, snappymail, dockge, npm, forgejo, dns, pbs
**Deployment Tools:**
- **Dockge** - Docker Compose UI for stack management
- **PVE API** - Proxmox VE for container/VM provisioning
- **Helper scripts** - `~/bin/pve create-ct` for automated container creation
## Environment Configuration
**Required Environment Variables (in credential files):**
DNS (`~/.config/dns/credentials`):
```
DNS_HOST=10.5.0.2
DNS_PORT=5380
DNS_USER=admin
DNS_PASS=<password>
```
Proxmox (`~/.config/pve/credentials`):
```
host=65.108.14.165:8006
user=root@pam
token_name=mgmt
token_value=<token>
```
Uptime Kuma (`~/.config/uptime-kuma/credentials`):
```
KUMA_HOST=10.5.0.10
KUMA_PORT=3001
KUMA_USER=admin
KUMA_PASS=<password>
```
Beszel (`~/.config/beszel/credentials`):
```
BESZEL_HOST=10.5.0.10
BESZEL_PORT=8090
BESZEL_USER=admin@example.com
BESZEL_PASS=<password>
```
Telegram (`~/telegram/credentials`):
```
TELEGRAM_BOT_TOKEN=<token>
```
## Webhooks & Callbacks
**Incoming Webhooks:**
- **Uptime Kuma** - No webhook ingestion detected
- **PBS** - Backup completion tasks (internal scheduling, no external webhooks)
- **Forgejo** - No webhook configuration documented
**Outgoing Notifications:**
- **Telegram Bot** - Two-way messaging for homelab status
- Commands: /status, /pbs, /backups, /beszel, /kuma, /ping
- File uploads: Photos saved to `~/telegram/images/`, documents to `~/telegram/files/`
- Text inbox: Messages saved to `~/telegram/inbox` for Claude review
**Event-Driven:**
- **PBS Scheduling** - Daily backup tasks at 01:00, 01:30, 02:00 (core, pve01, pve02)
- **Prune/GC** - Scheduled at 21:00 (prune) and 22:30 (garbage collection)
## VPN & Remote Access
**Tailscale Network:**
- **Primary relay:** 10.5.0.134 + 10.9.1.10 (VMID 1000, exit node capable)
- **Tailscale IPs:**
- PBS: 100.115.85.120
- Synology NAS: 100.105.26.130
- dev: 100.85.227.17
- sentry: 100.83.236.113
- Friends' nodes: pve01 (100.99.118.54), pve02 (100.82.87.108)
- Other devices: mge-t14, mikflix, xanderryzen, nvr01, tailscalemg
**SSH Access Pattern:**
- All containers/VMs accessible via SSH from mgmt (102)
- SSH keys pre-deployed to all systems
- Tailscale used for accessing from external networks
## External DNS
**DNS Provider:** dns.services (Danish free DNS with API)
- Domains managed:
- georgsen.dk
- dataloes.dk
- microsux.dk
- warradejendomme.dk
- Used for external domain registration only
- Internal zone lookups go to Technitium (10.5.0.2)
---
*Integration audit: 2026-02-04*

View file

@ -1,152 +0,0 @@
# Technology Stack
**Analysis Date:** 2026-02-04
## Languages
**Primary:**
- **Bash** - Infrastructure automation, API wrappers, system integration
- Helper scripts at `~/bin/` for service APIs
- Installation and setup in `pve-homelab-kit/install.sh`
- **Python 3.12.3** - Management tools, monitoring, bot automation
- Virtual environment: `~/venv/` (activated with `source ~/venv/bin/activate`)
- Primary usage: API clients, Telegram bot, helper scripts
## Runtime
**Environment:**
- **Python 3.12.3** (system)
- **Bash 5+** (system shell)
**Package Manager:**
- **pip** v24.0 (Python package manager)
- Lockfile: Virtual environment at `~/venv/` (not traditional pip.lock)
## Frameworks
**Core Infrastructure:**
- **Proxmox VE** (v8.x) - Hypervisor/container platform on core.georgsen.dk
- **Proxmox Backup Server (PBS)** v2.x - Backup infrastructure (10.5.0.6:8007)
- **LXC Containers** - Primary virtualization method
- **KVM VMs** - Full VMs when needed (mail server VM 200)
- **Docker/Docker Compose** - Application deployment via Dockge (10.5.0.10)
**Application Frameworks:**
- **Nginx Proxy Manager (NPM)** v2.x - Reverse proxy, SSL (10.5.0.1:80/443/81)
- **Dockge** - Docker Compose stack management UI (10.5.0.10:5001)
- **Forgejo** v10.0.1 - Self-hosted Git server (10.5.0.14:3000)
- **Technitium DNS** - DNS server with API (10.5.0.2:5380)
**Monitoring & Observability:**
- **Uptime Kuma** - Service/endpoint monitoring (10.5.0.10:3001)
- **Beszel** - Server metrics dashboard (10.5.0.10:8090)
**Messaging:**
- **Stalwart Mail Server** - Mail server (VM 200, IP 65.108.14.164)
- **Snappymail** - Webmail UI (djmaze/snappymail:latest, 10.5.0.10:8888)
**Data Storage:**
- **DragonflyDB** - Redis-compatible in-memory datastore (10.5.0.10:6379)
- Password protected, used for session/cache storage
- **PostgreSQL 13+** (VMID 103, 10.5.0.109) - Community managed database
- **Redis/DragonflyDB** (VMID 104, 10.5.0.111) - Session/cache store
## Key Dependencies
**Python Packages (in ~/venv/):**
**Proxmox API:**
- `proxmoxer` v2.2.0 - Python API client for Proxmox VE
- File: `~/bin/pve` (list, status, start, stop, create-ct operations)
**Monitoring APIs:**
- `uptime-kuma-api` v1.2.1 - Uptime Kuma monitoring client
- File: `~/bin/kuma` (monitor management)
- `pocketbase` v0.15.0 - Beszel dashboard backend client
- File: `~/bin/beszel` (system monitoring)
**Communications:**
- `python-telegram-bot` v22.5 - Telegram Bot API
- File: `~/telegram/bot.py` (homelab management bot)
**HTTP Clients:**
- `requests` v2.32.5 - HTTP library for API calls
- `httpx` v0.28.1 - Async HTTP client
- `urllib3` v2.6.3 - Low-level HTTP client
**Networking & WebSockets:**
- `websocket-client` v1.9.0 - WebSocket client library
- `python-socketio` v5.16.0 - Socket.IO client
- `simple-websocket` v1.1.0 - WebSocket utilities
**Utilities:**
- `certifi` v2026.1.4 - SSL certificate verification
- `charset-normalizer` v3.4.4 - Character encoding detection
- `packaging` v25.0 - Version/requirement parsing
## Configuration
**Environment:**
- **Bash scripts:** Load credentials from `~/.config/{service}/credentials` files
- `~/.config/pve/credentials` - Proxmox API token
- `~/.config/dns/credentials` - Technitium DNS API
- `~/.config/beszel/credentials` - Beszel dashboard API
- `~/.config/uptime-kuma/credentials` - Uptime Kuma API
- `~/.config/forgejo/credentials` - Forgejo Git API
- **Python scripts:** Similar credential loading pattern
- **Telegram bot:** `~/telegram/credentials` file with `TELEGRAM_BOT_TOKEN`
**Build & Runtime Configuration:**
- Python venv activation: `source ~/venv/bin/activate`
- Helper scripts use shebang: `#!/home/mikkel/venv/bin/python3` or `#!/bin/bash`
- All scripts in `~/bin/` are executable and PATH-accessible
**Documentation:**
- `CLAUDE.md` - Development environment guidance
- `homelab-documentation.md` - Infrastructure reference (22KB, comprehensive)
- `README.md` - Quick container/service overview
- `TODO.md` - Pending maintenance tasks
## Platform Requirements
**Development/Management:**
- **Container:** LXC on Proxmox VE (VMID 102, "mgmt")
- **OS:** Debian-based Linux (venv requires Linux filesystem)
- **User:** mikkel (UID 1000, group georgsen GID 1000)
- **SSH:** Pre-installed keys for accessing other containers/VMs
- **Network:** Tailscale VPN for external access, internal vmbr1 (10.5.0.0/24)
**Production (Core Server):**
- **Provider:** Hetzner AX52 (Helsinki)
- **CPU:** AMD Ryzen 7 3700X
- **RAM:** 64GB ECC
- **Storage:** 2x 1TB NVMe (RAID0 via ZFS)
- **Public IP:** 65.108.14.165/26 (BGP routed)
- **Network bridges:** vmbr0 (public), vmbr1 (internal), vmbr2 (vSwitch)
**Backup Target:**
- **Synology NAS** (home network via Tailscale)
- **Protocol:** CIFS/SMB 3.0 over Tailscale
- **Mount point on PBS:** `/mnt/synology` (bind-mounted as datastore)
## Deployment & Access
**Service URLs:**
- **Proxmox Web UI:** https://65.108.14.165:8006 (public, home IP whitelisted)
- **NPM Admin:** http://10.5.0.1:81 (internal only)
- **DNS Admin:** https://dns.georgsen.dk (home IP whitelisted via access list)
- **PBS Web UI:** https://pbs.georgsen.dk:8007 (home IP whitelisted)
- **Dockge Admin:** https://dockge.georgsen.dk:5001 (home IP whitelisted)
- **Forgejo:** https://git.georgsen.dk (public)
- **Status Page:** https://status.georgsen.dk (Uptime Kuma)
- **Dashboard:** https://dashboard.georgsen.dk (Beszel metrics)
**SSL Certificates:**
- **Provider:** Let's Encrypt via NPM
- **Challenge method:** HTTP-01
- **Auto-renewal:** Handled by NPM
---
*Stack analysis: 2026-02-04*

View file

@ -1,228 +0,0 @@
# Codebase Structure
**Analysis Date:** 2026-02-04
## Directory Layout
```
/home/mikkel/homelab/
├── .planning/ # Planning and analysis artifacts
│ └── codebase/ # Codebase documentation (ARCHITECTURE.md, STRUCTURE.md, etc.)
├── .git/ # Git repository metadata
├── telegram/ # Telegram bot and message storage
│ ├── bot.py # Main bot implementation
│ ├── credentials # Telegram bot token (env var: TELEGRAM_BOT_TOKEN)
│ ├── authorized_users # Allowlist of chat IDs (one per line)
│ ├── inbox # Messages from admin (appended on each message)
│ ├── images/ # Photos sent via Telegram (timestamped)
│ └── files/ # Files sent via Telegram (timestamped)
├── pve-homelab-kit/ # PVE installation kit (subproject)
│ ├── install.sh # Installation script
│ ├── PROMPT.md # Project context for Claude
│ ├── .planning/ # Subproject planning docs
│ └── README.md # Setup instructions
├── npm/ # Nginx Proxy Manager configuration
│ └── npm-api.conf # API credentials reference
├── dockge/ # Docker Compose Manager configuration
│ └── credentials # Dockge API access
├── dns/ # Technitium DNS configuration
│ └── credentials # DNS API credentials (env vars: DNS_HOST, DNS_PORT, DNS_USER, DNS_PASS)
├── dns-services/ # DNS services configuration
│ └── credentials # Alternative DNS credentials
├── pve/ # Proxmox VE configuration
│ └── credentials # PVE API credentials (env vars: host, user, token_name, token_value)
├── beszel/ # Beszel monitoring dashboard
│ ├── credentials # Beszel API credentials
│ └── README.md # API and agent setup guide
├── forgejo/ # Forgejo Git server configuration
│ └── credentials # Forgejo API access
├── uptime-kuma/ # Uptime Kuma monitoring
│ ├── credentials # Kuma API credentials (env vars: KUMA_HOST, KUMA_PORT, KUMA_API_KEY)
│ ├── README.md # REST API reference and Socket.IO documentation
│ └── kuma_api_doc.png # Full API documentation screenshot
├── README.md # Repository overview and service table
├── CLAUDE.md # Claude Code guidance and infrastructure quick reference
├── homelab-documentation.md # Authoritative infrastructure documentation
├── TODO.md # Pending maintenance tasks
└── .gitignore # Git ignore patterns (credentials, sensitive files)
```
## Directory Purposes
**telegram/:**
- Purpose: Two-way Telegram bot for management commands and admin notifications
- Contains: Python bot code, token credentials, authorized user allowlist, message inbox, uploaded media
- Key files: `bot.py` (407 lines), `credentials`, `authorized_users`, `inbox`
- Not committed: `credentials`, `inbox`, `images/*`, `files/*` (in `.gitignore`)
**pve-homelab-kit/:**
- Purpose: Standalone PVE installation and initial setup toolkit
- Contains: Installation script, configuration examples, planning documents
- Key files: `install.sh` (executable automation), `PROMPT.md` (context for Claude), subproject `.planning/`
- Notes: Separate git repository (submodule or independent), for initial PVE deployment
**npm/:**
- Purpose: Nginx Proxy Manager reverse proxy configuration
- Contains: API credentials reference
- Key files: `npm-api.conf`
**dns/ & dns-services/:**
- Purpose: Technitium DNS server configuration (dual credential sets)
- Contains: API authentication credentials
- Key files: `credentials` (host, port, user, password)
**pve/:**
- Purpose: Proxmox VE API access credentials
- Contains: Token-based authentication data
- Key files: `credentials` (host, user, token_name, token_value)
**dockge/, forgejo/, beszel/, uptime-kuma/:**
- Purpose: Service-specific API credentials and documentation
- Contains: Token/API key for each service
- Key files: `credentials`, service-specific `README.md` (beszel, uptime-kuma)
**homelab-documentation.md:**
- Purpose: Authoritative reference for all infrastructure details
- Contains: Network topology, VM/container registry, service mappings, security rules, firewall config
- Must be updated whenever: services added/removed, IPs changed, configurations modified
**CLAUDE.md:**
- Purpose: Claude Code (AI assistant) guidance and quick reference
- Contains: Environment setup, helper script signatures, API access patterns, security notes
- Auto-loaded by Claude when working in this repository
**.planning/codebase/:**
- Purpose: GSD codebase analysis artifacts
- Will contain: ARCHITECTURE.md, STRUCTURE.md, CONVENTIONS.md, TESTING.md, STACK.md, INTEGRATIONS.md, CONCERNS.md
- Generated by: GSD codebase mapper, consumed by GSD planner/executor
## Key File Locations
**Entry Points:**
- `telegram/bot.py`: Telegram bot entry point (asyncio-based)
- `pve-homelab-kit/install.sh`: Initial PVE setup entry point
**Configuration:**
- `homelab-documentation.md`: Infrastructure reference (IPs, ports, network topology, firewall rules)
- `CLAUDE.md`: Claude Code environment setup and quick reference
- `.planning/`: Planning and analysis artifacts
**Core Logic:**
- `~/bin/pve`: Proxmox VE API wrapper (Python, 200 lines)
- `~/bin/dns`: Technitium DNS API wrapper (Bash, 107 lines)
- `~/bin/pbs`: PBS backup status and management (Bash, 400+ lines)
- `~/bin/beszel`: Beszel monitoring dashboard API (Bash/Python, 137 lines)
- `~/bin/kuma`: Uptime Kuma monitor management (Bash, 144 lines)
- `~/bin/updates`: Service version checking and updates (Bash, 450+ lines)
- `~/bin/telegram`: CLI helper for Telegram bot control (2-way messaging)
- `~/bin/npm-api`: NPM reverse proxy management (wrapper script)
- `telegram/bot.py`: Telegram bot with command handlers and media management
**Testing:**
- Not applicable (no automated tests in this repository)
## Naming Conventions
**Files:**
- Lowercase with hyphens for multi-word names: `npm-api`, `uptime-kuma`, `pve-homelab-kit`
- Markdown documentation: UPPERCASE.md (`README.md`, `CLAUDE.md`, `homelab-documentation.md`)
- Configuration/credential files: lowercase `credentials` with optional zone prefix
**Directories:**
- Service-specific: lowercase, match service name (`npm`, `dns`, `dockge`, `forgejo`, `beszel`, `telegram`)
- Functional: category name (`pve`, `pve-homelab-kit`)
- Hidden: `.planning`, `.git` for system metadata
**Variables & Parameters:**
- Environment variables: UPPERCASE_WITH_UNDERSCORES (e.g., `TELEGRAM_BOT_TOKEN`, `DNS_HOST`, `KUMA_API_KEY`)
- Bash functions: lowercase_with_underscores (e.g., `get_token()`, `run_command()`, `ssh_pbs()`)
- Python functions: lowercase_with_underscores (e.g., `is_authorized()`, `run_command()`, `get_status()`)
## Where to Add New Code
**New Helper Script (CLI tool):**
- Primary code: `~/bin/{service_name}` (no extension, executable)
- Credentials: `~/.config/{service_name}/credentials`
- Documentation: Top-of-file comment with usage examples
- Language: Bash for shell commands/APIs, Python for complex logic (use Python venv)
**New Service Configuration:**
- Directory: `/home/mikkel/homelab/{service_name}/`
- Credentials file: `{service_name}/credentials`
- Documentation: `{service_name}/README.md` (include API examples and setup)
- Git handling: All credentials in `.gitignore`, document as `credentials.example` if needed
**New Telegram Bot Command:**
- File: `telegram/bot.py` (add function to existing handlers section)
- Pattern: Async function named `cmd_name()`, check authorization first with `is_authorized()`
- Result: Send back via `update.message.reply_text()`
- Timeout: Default 30 seconds (configurable via `run_command()`)
**New Documentation:**
- Infrastructure changes: Update `homelab-documentation.md` (IPs, service registry, network config)
- Claude Code guidance: Update `CLAUDE.md` (new helper scripts, environment setup)
- Service-specific: Create `{service_name}/README.md` with API examples and access patterns
**Shared Utilities:**
- Location: Create in `~/lib/` or `~/venv/lib/` for Python packages
- Access: Import in other scripts or source in Bash
## Special Directories
**.planning/codebase/:**
- Purpose: GSD analysis artifacts
- Generated: Yes (by GSD codebase mapper)
- Committed: Yes (part of repository for reference)
**telegram/images/ & telegram/files/:**
- Purpose: Media uploaded via Telegram bot
- Generated: Yes (bot downloads on receipt)
- Committed: No (in `.gitignore`)
**telegram/inbox:**
- Purpose: Admin messages to Claude
- Generated: Yes (bot appends messages)
- Committed: No (in `.gitignore`)
**.git/**
- Purpose: Git repository metadata
- Generated: Yes (by git)
- Committed: No (system directory)
**pve-homelab-kit/.planning/**
- Purpose: Subproject planning documents
- Generated: Yes (by GSD mapper on subproject)
- Committed: Yes (tracked in subproject)
## Credential File Organization
All credentials stored in `~/.config/{service}/credentials` using key=value format (one per line):
```bash
# ~/.config/pve/credentials
host=core.georgsen.dk
user=root@pam
token_name=automation
token_value=<token-uuid>
# ~/.config/dns/credentials
DNS_HOST=10.5.0.2
DNS_PORT=5380
DNS_USER=admin
DNS_PASS=<password>
# ~/.config/beszel/credentials
BESZEL_HOST=10.5.0.10
BESZEL_PORT=8090
BESZEL_USER=<email>
BESZEL_PASS=<password>
```
**Loading Pattern:**
- Bash: `source ~/.config/{service}/credentials` or inline `$(cat ~/.config/{service}/credentials | grep ^KEY= | cut -d= -f2-)`
- Python: Read file, parse `key=value` lines into dict
- Never hardcode credentials in scripts
---
*Structure analysis: 2026-02-04*

View file

@ -1,324 +0,0 @@
# Testing Patterns
**Analysis Date:** 2026-02-04
## Test Framework
**Current State:**
- **No automated testing detected** in this codebase
- No test files found (no `*.test.py`, `*_test.py`, `*.spec.py` files)
- No testing configuration files (no `pytest.ini`, `tox.ini`, `setup.cfg`)
- No test dependencies in requirements (no pytest, unittest, mock imports)
**Implications:**
This is a **scripts-only codebase** - all code consists of CLI helper scripts and one bot automation. Manual testing is the primary validation method.
## Script Testing Approach
Since this codebase consists entirely of helper scripts and automation, testing is manual and implicit:
**Command-Line Validation:**
- Each script has a usage/help message showing all commands
- Example from `pve`:
```python
if len(sys.argv) < 2:
print(__doc__)
sys.exit(1)
```
- Example from `telegram`:
```bash
case "${1:-}" in
send) cmd_send "$2" ;;
inbox) cmd_inbox ;;
*) usage; exit 1 ;;
esac
```
**Entry Point Testing:**
Main execution guards are used throughout:
```python
if __name__ == "__main__":
main()
```
This allows scripts to be imported (theoretically) without side effects, though in practice they are not used as modules.
## API Integration Testing
**Pattern: Try-Except Fallback:**
Many scripts handle multiple service types by trying different approaches:
From `pve` script (lines 55-85):
```python
def get_status(vmid):
"""Get detailed status of a VM/container."""
vmid = int(vmid)
# Try as container first
try:
status = pve.nodes(NODE).lxc(vmid).status.current.get()
# ... container-specific logic
return
except:
pass
# Try as VM
try:
status = pve.nodes(NODE).qemu(vmid).status.current.get()
# ... VM-specific logic
return
except:
pass
print(f"VMID {vmid} not found")
```
This is a pragmatic testing pattern: if one API call fails, try another. Useful for development but fragile without structured error handling.
## Command Dispatch Testing
**Pattern: Argument Validation:**
All scripts validate argument count before executing commands:
From `beszel` script (lines 101-124):
```python
if __name__ == "__main__":
if len(sys.argv) < 2:
usage()
cmd = sys.argv[1]
try:
if cmd == "list":
cmd_list()
elif cmd == "info" and len(sys.argv) == 3:
cmd_info(sys.argv[2])
elif cmd == "add" and len(sys.argv) >= 4:
# ...
else:
usage()
except Exception as e:
print(f"Error: {e}")
sys.exit(1)
```
This catches typos in command names and wrong argument counts, showing usage help.
## Data Processing Testing
**Bash String Parsing:**
Complex regex patterns used in `pbs` script require careful testing:
From `pbs` (lines 122-143):
```bash
ssh_pbs 'tail -500 /var/log/proxmox-backup/tasks/archive 2>/dev/null' | while IFS= read -r line; do
if [[ "$line" =~ UPID:pbs:[^:]+:[^:]+:[^:]+:([0-9A-Fa-f]+):([^:]+):([^:]+):.*\ [0-9A-Fa-f]+\ (OK|ERROR|WARNINGS[^$]*) ]]; then
task_time=$((16#${BASH_REMATCH[1]}))
task_type="${BASH_REMATCH[2]}"
task_target="${BASH_REMATCH[3]}"
status="${BASH_REMATCH[4]}"
# ... process matched groups
fi
done
```
**Manual Testing Approach:**
- Run command against live services
- Inspect output format visually
- Verify JSON parsing with inline Python:
```bash
echo "$gc_json" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('disk-bytes',0))"
```
## Mock Testing Pattern (Telegram Bot)
The telegram bot has one pattern that resembles mocking - subprocess mocking via `run_command()`:
From `telegram/bot.py` (lines 60-78):
```python
def run_command(cmd: list, timeout: int = 30) -> str:
"""Run a shell command and return output."""
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=timeout,
env={**os.environ, 'PATH': f"/home/mikkel/bin:{os.environ.get('PATH', '')}"}
)
output = result.stdout or result.stderr or "No output"
# Telegram has 4096 char limit per message
if len(output) > 4000:
output = output[:4000] + "\n... (truncated)"
return output
except subprocess.TimeoutExpired:
return "Command timed out"
except Exception as e:
return f"Error: {e}"
```
This function:
- Runs external commands with timeout protection
- Handles both stdout and stderr
- Truncates output for Telegram's message size limits
- Returns error messages instead of raising exceptions
This enables testing command handlers by mocking which commands are available.
## Timeout Testing
The telegram bot handles timeouts explicitly:
From `telegram/bot.py`:
```python
result = subprocess.run(
["ping", "-c", "3", "-W", "2", host],
capture_output=True,
text=True,
timeout=10 # 10 second timeout
)
```
Different commands have different timeouts:
- `ping_host()`: 10 second timeout
- `run_command()`: 30 second default (configurable)
- `backups()`: 60 second timeout (passed to run_command)
This prevents the bot from hanging on slow/unresponsive services.
## Error Message Testing
Scripts validate successful API responses:
From `dns` script (lines 62-69):
```bash
curl -s "$BASE/zones/records/add?..." | python3 -c "
import sys, json
data = json.load(sys.stdin)
if data['status'] == 'ok':
print(f\"Added: {data['response']['addedRecord']['name']} -> ...\")
else:
print(f\"Error: {data.get('errorMessage', 'Unknown error')}\")
"
```
This pattern:
- Parses JSON response
- Checks status field
- Returns user-friendly error message on failure
## Credential Testing
Scripts assume credentials exist and are properly formatted:
From `pve` (lines 17-34):
```python
creds_path = Path.home() / ".config" / "pve" / "credentials"
creds = {}
with open(creds_path) as f:
for line in f:
if "=" in line:
key, value = line.strip().split("=", 1)
creds[key] = value
pve = ProxmoxAPI(
creds["host"],
user=creds["user"],
token_name=creds["token_name"],
token_value=creds["token_value"],
verify_ssl=False
)
```
**Missing Error Handling:**
- No check that credentials file exists
- No check that required keys are present
- No validation that API connection succeeds
- Will crash with KeyError or FileNotFoundError if file missing
**Recommendation for Testing:**
Add pre-flight validation:
```python
required_keys = ["host", "user", "token_name", "token_value"]
missing = [k for k in required_keys if k not in creds]
if missing:
print(f"Error: Missing credentials: {', '.join(missing)}")
sys.exit(1)
```
## File I/O Testing
Telegram bot handles file operations defensively:
From `telegram/bot.py` (lines 277-286):
```python
# Create images directory
images_dir = Path(__file__).parent / 'images'
images_dir.mkdir(exist_ok=True)
# Get the largest photo (best quality)
photo = update.message.photo[-1]
file = await context.bot.get_file(photo.file_id)
# Download the image
filename = f"{file_timestamp}.jpg"
filepath = images_dir / filename
await file.download_to_drive(filepath)
```
**Patterns:**
- `mkdir(exist_ok=True)`: Safely creates directory, doesn't error if exists
- Timestamp-based filenames to avoid collisions: `f"{file_timestamp}_{original_name}"`
- Pathlib for cross-platform path handling
## What to Test If Writing Tests
If converting to automated tests, prioritize:
**High Priority:**
1. **Telegram bot command dispatch** (`telegram/bot.py` lines 107-366)
- Each command handler should have unit tests
- Mock `subprocess.run()` to avoid calling actual commands
- Test authorization checks (`is_authorized()`)
- Test output truncation for large responses
2. **Credential loading** (all helper scripts)
- Test missing credentials file error
- Test malformed credentials
- Test missing required keys
3. **API response parsing** (`dns`, `pbs`, `beszel`, `kuma`)
- Test JSON parsing errors
- Test malformed responses
- Test status code handling
**Medium Priority:**
1. **Bash regex parsing** (`pbs` task/error log parsing)
- Test hex timestamp conversion
- Test status code extraction
- Test task target parsing with special characters
2. **Timeout handling** (all `run_command()` calls)
- Test command timeout
- Test output truncation
- Test error message formatting
**Low Priority:**
1. Integration tests with real services (kept in separate test suite)
2. Performance tests for large data sets
## Current Test Coverage
**Implicit Testing:**
- Manual CLI testing during development
- Live service testing (commands run against real PVE, PBS, DNS, etc.)
- User/admin interaction testing (Telegram bot testing via /start, /status, etc.)
**Gap:**
- No regression testing
- No automated validation of API response formats
- No error case testing
- No refactoring safety net
---
*Testing analysis: 2026-02-04*

View file

@ -1,12 +0,0 @@
{
"mode": "yolo",
"depth": "quick",
"parallelization": true,
"commit_docs": true,
"model_profile": "balanced",
"workflow": {
"research": true,
"plan_check": true,
"verifier": false
}
}

View file

@ -1,305 +0,0 @@
---
phase: 01-session-process-foundation
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- telegram/session_manager.py
- telegram/personas/default.json
- telegram/personas/brainstorm.json
- telegram/personas/planner.json
- telegram/personas/research.json
autonomous: true
must_haves:
truths:
- "SessionManager.create_session('test') creates directory at ~/telegram/sessions/test/ with metadata.json"
- "SessionManager.create_session('test', persona='brainstorm') copies brainstorm persona into session directory"
- "SessionManager.switch_session('test') updates active session and returns previous session name"
- "SessionManager.get_session('test') returns session metadata including name, status, timestamps"
- "Session names are validated: alphanumeric, hyphens, underscores only"
- "Persona library templates exist at ~/telegram/personas/ with at least default, brainstorm, planner, research"
artifacts:
- path: "telegram/session_manager.py"
provides: "Session lifecycle management"
min_lines: 80
contains: "class SessionManager"
- path: "telegram/personas/default.json"
provides: "Default persona template"
contains: "system_prompt"
- path: "telegram/personas/brainstorm.json"
provides: "Brainstorming persona template"
contains: "system_prompt"
key_links:
- from: "telegram/session_manager.py"
to: "telegram/personas/"
via: "persona library lookup on create_session"
pattern: "personas.*json"
- from: "telegram/session_manager.py"
to: "telegram/sessions/"
via: "directory creation and metadata writes"
pattern: "sessions.*metadata\\.json"
---
<objective>
Create the session management module and persona library that provides the filesystem foundation for multi-session Claude Code conversations.
Purpose: Sessions are the core abstraction — each session is an isolated directory where a Claude Code subprocess will run (Plan 02) with its own conversation history, metadata, and persona configuration. This plan builds the session CRUD operations and persona template system.
Output: `telegram/session_manager.py` module and `telegram/personas/` directory with reusable persona templates.
</objective>
<execution_context>
@/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md
@/home/mikkel/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-session-process-foundation/01-CONTEXT.md
@.planning/phases/01-session-process-foundation/01-RESEARCH.md
@telegram/bot.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create SessionManager module</name>
<files>telegram/session_manager.py</files>
<action>
Create `telegram/session_manager.py` with a `SessionManager` class that manages session lifecycle.
**Session directory structure:**
```
~/telegram/sessions/<name>/
metadata.json # Session state
persona.json # Session persona (copied from library or custom)
.claude/ # Auto-created by Claude Code CLI
```
**SessionManager class design:**
```python
class SessionManager:
def __init__(self, base_dir: Path = None):
# base_dir defaults to ~/telegram/sessions/
# Also tracks personas_dir at ~/telegram/personas/
# Tracks active_session (str or None)
# Tracks all sessions dict[str, SessionMetadata]
```
**Required methods:**
1. `create_session(name: str, persona: str = None) -> Path`
- Validate name: regex `^[a-zA-Z0-9_-]+$`, max 50 chars
- If session already exists: raise ValueError with clear message (least-surprising: don't silently overwrite)
- Create directory at `sessions/<name>/`
- If persona specified: look up `personas/<persona>.json`, copy to `sessions/<name>/persona.json`
- If persona not specified: copy `personas/default.json` to session
- Write `metadata.json` with fields:
```json
{
"name": "session-name",
"created": "ISO-8601 timestamp",
"last_active": "ISO-8601 timestamp",
"persona": "persona-name-or-null",
"pid": null,
"status": "idle"
}
```
- Return session directory Path
- Status values: "idle" (no process), "active" (has running process, is current), "suspended" (has running process, not current)
2. `switch_session(name: str) -> str | None`
- If session doesn't exist: raise ValueError
- If already active: return None (no-op)
- Mark current active session as "suspended" in metadata (process stays alive per CONTEXT.md decision)
- Mark new session as "active" in metadata
- Update `last_active` timestamp on new session
- Update `self.active_session`
- Return previous session name (or None if no previous)
3. `get_session(name: str) -> dict`
- Read and return metadata.json contents for named session
- Raise ValueError if session doesn't exist
4. `list_sessions() -> list[dict]`
- Return list of all session metadata, sorted by last_active (most recent first)
5. `get_active_session() -> str | None`
- Return name of active session or None
6. `update_session(name: str, **kwargs) -> None`
- Update specific fields in session metadata (used by subprocess module to set PID, status)
7. `session_exists(name: str) -> bool`
- Check if session directory exists
8. `get_session_dir(name: str) -> Path`
- Return Path to session directory
9. `load_persona(name: str) -> dict`
- Load persona JSON from library (~/telegram/personas/<name>.json)
- Return persona dict or raise FileNotFoundError
**Implementation notes:**
- Use `pathlib.Path` throughout
- Use `json` stdlib for metadata reads/writes
- Make metadata reads lazy (read from disk each time to avoid stale state)
- Add `logging` using `logging.getLogger(__name__)`
- Include type hints for all methods
- Add module docstring explaining the session model
**DO NOT:**
- Import or depend on claude_subprocess.py (that's Plan 02)
- Add Telegram-specific code (that's Plan 03)
- Implement idle timeout (that's Phase 3)
</action>
<verify>
```bash
cd ~/homelab
source ~/venv/bin/activate
python3 -c "
from telegram.session_manager import SessionManager
sm = SessionManager()
# Create session
path = sm.create_session('test-session')
assert path.exists()
assert (path / 'metadata.json').exists()
assert (path / 'persona.json').exists()
meta = sm.get_session('test-session')
assert meta['name'] == 'test-session'
assert meta['status'] == 'idle'
# Switch session
sm.create_session('second-session')
prev = sm.switch_session('second-session')
assert prev is None # No previous active
assert sm.get_active_session() == 'second-session'
prev = sm.switch_session('test-session')
assert prev == 'second-session'
# List sessions
sessions = sm.list_sessions()
assert len(sessions) >= 2
# Validation
try:
sm.create_session('bad name!')
assert False, 'Should have raised ValueError'
except ValueError:
pass
# Cleanup
import shutil
shutil.rmtree(path.parent / 'test-session')
shutil.rmtree(path.parent / 'second-session')
print('All session manager tests passed!')
"
```
</verify>
<done>SessionManager creates isolated session directories with metadata, handles persona inheritance from library, validates names, switches active session correctly, and lists all sessions sorted by activity.</done>
</task>
<task type="auto">
<name>Task 2: Create persona library with default templates</name>
<files>
telegram/personas/default.json
telegram/personas/brainstorm.json
telegram/personas/planner.json
telegram/personas/research.json
</files>
<action>
Create the persona library directory and four starter personas at `~/homelab/telegram/personas/`.
**Persona JSON schema:**
```json
{
"name": "persona-display-name",
"description": "One-line description of this persona's purpose",
"system_prompt": "The system prompt that shapes Claude's behavior in this session",
"settings": {
"model": "claude-sonnet-4-20250514",
"max_turns": 25
}
}
```
**Personas to create:**
1. **default.json** - General-purpose assistant
- system_prompt: "You are Claude, an AI assistant helping Mikkel manage his homelab infrastructure. You have full access to the management container's tools and can SSH to other containers. Be helpful, thorough, and proactive about suggesting improvements. When making changes, explain what you're doing and why."
- settings.model: "claude-sonnet-4-20250514" (cost-effective default)
- settings.max_turns: 25
2. **brainstorm.json** - Creative ideation mode
- system_prompt: "You are in brainstorming mode. Generate ideas freely without filtering. Build on previous ideas. Explore unconventional approaches. Ask probing questions to understand the problem space better. Don't worry about feasibility yet - that comes later. Output ideas as bullet lists for easy scanning."
- settings.model: "claude-sonnet-4-20250514"
- settings.max_turns: 50 (longer conversations for ideation)
3. **planner.json** - Structured planning mode
- system_prompt: "You are in planning mode. Break down complex tasks into clear, actionable steps. Identify dependencies and ordering. Estimate effort and flag risks. Use structured formats (numbered lists, tables) for clarity. Ask clarifying questions about requirements before diving into solutions."
- settings.model: "claude-sonnet-4-20250514"
- settings.max_turns: 30
4. **research.json** - Deep investigation mode
- system_prompt: "You are in research mode. Investigate topics thoroughly. Check documentation, source code, and configuration files. Cross-reference information. Cite your sources (file paths, URLs). Distinguish between facts and inferences. Summarize findings clearly with actionable recommendations."
- settings.model: "claude-sonnet-4-20250514"
- settings.max_turns: 30
**Notes:**
- The `settings` block will be consumed by the subprocess module (Plan 02) to configure Claude Code CLI flags
- Keep system_prompts concise but distinctive — each persona should feel like a different "mode"
- Use claude-sonnet-4-20250514 as default model (good balance of capability and cost for Telegram-driven sessions)
- The schema is intentionally simple for Phase 1; can be extended in future phases
</action>
<verify>
```bash
cd ~/homelab
python3 -c "
import json
from pathlib import Path
personas_dir = Path('telegram/personas')
assert personas_dir.exists(), 'personas directory missing'
required = ['default.json', 'brainstorm.json', 'planner.json', 'research.json']
for name in required:
path = personas_dir / name
assert path.exists(), f'{name} missing'
data = json.loads(path.read_text())
assert 'name' in data, f'{name} missing name field'
assert 'description' in data, f'{name} missing description field'
assert 'system_prompt' in data, f'{name} missing system_prompt field'
assert 'settings' in data, f'{name} missing settings field'
assert 'model' in data['settings'], f'{name} missing settings.model'
assert 'max_turns' in data['settings'], f'{name} missing settings.max_turns'
print(f' {name}: OK ({data[\"name\"]})')
print('All persona templates valid!')
"
```
</verify>
<done>Four persona templates exist in ~/homelab/telegram/personas/ with valid JSON schema (name, description, system_prompt, settings.model, settings.max_turns). Each persona has a distinct system_prompt that shapes Claude's behavior differently.</done>
</task>
</tasks>
<verification>
1. `telegram/session_manager.py` exists with `SessionManager` class
2. `telegram/personas/` directory contains 4 valid persona JSON files
3. Creating a session writes metadata.json and copies persona to session directory
4. Session switching updates active session and marks previous as suspended
5. Session name validation rejects invalid characters
6. No imports from claude_subprocess or telegram bot modules
</verification>
<success_criteria>
- SessionManager can create, list, switch, and query sessions purely via filesystem operations
- Persona library provides 4 distinct templates with consistent schema
- Session directories are fully isolated (each has own metadata.json and persona.json)
- All verification scripts pass without errors
</success_criteria>
<output>
After completion, create `.planning/phases/01-session-process-foundation/01-01-SUMMARY.md`
</output>

View file

@ -1,146 +0,0 @@
---
phase: 01-session-process-foundation
plan: 01
subsystem: infra
tags: [python, sessions, filesystem, telegram-bot]
# Dependency graph
requires: []
provides:
- SessionManager class for session lifecycle management
- Persona library with 4 default templates (default, brainstorm, planner, research)
- Session directory structure with metadata and persona configuration
- Session switching and active session tracking
affects: [02-process-management, 03-telegram-integration]
# Tech tracking
tech-stack:
added: []
patterns:
- Path-based session isolation via directory per session
- JSON metadata persistence for session state
- Persona library pattern for behavior templates
key-files:
created:
- telegram/session_manager.py
- telegram/__init__.py
- telegram/personas/default.json
- telegram/personas/brainstorm.json
- telegram/personas/planner.json
- telegram/personas/research.json
modified: []
key-decisions:
- "Sessions created as 'idle', become 'active' only on explicit switch"
- "Persona library uses JSON schema with name, description, system_prompt, settings"
- "Base directory defaults to ~/homelab/telegram/sessions/"
patterns-established:
- "Session lifecycle: idle → active → suspended"
- "Persona inheritance: copy from library to session directory on create"
- "Metadata read on demand (no caching) to avoid stale state"
# Metrics
duration: 3min
completed: 2026-02-04
---
# Phase 01 Plan 01: Session Manager & Persona Library Summary
**Session lifecycle management with filesystem-backed state and persona template library for multi-context Claude Code conversations**
## Performance
- **Duration:** 3 min
- **Started:** 2026-02-04T17:31:09Z
- **Completed:** 2026-02-04T17:34:10Z
- **Tasks:** 2
- **Files modified:** 7
## Accomplishments
- SessionManager class with complete session CRUD operations
- Session directory structure with metadata.json and persona.json per session
- Persona library with 4 distinct behavioral templates
- Session validation, switching, and active session tracking
- Fully isolated session directories ready for Claude Code subprocess
## Task Commits
Each task was committed atomically:
1. **Task 1: Create SessionManager module** - `447855c` (feat)
2. **Task 2: Create persona library with default templates** - `ba8acf0` (feat)
## Files Created/Modified
- `telegram/session_manager.py` - SessionManager class with session lifecycle management
- `telegram/__init__.py` - Python package initialization
- `telegram/personas/default.json` - General-purpose homelab assistant persona
- `telegram/personas/brainstorm.json` - Creative ideation mode persona
- `telegram/personas/planner.json` - Structured planning mode persona
- `telegram/personas/research.json` - Deep investigation mode persona
## Decisions Made
**1. Sessions created as 'idle', activated explicitly**
- Rationale: Creating a session doesn't mean it's immediately in use. User must explicitly switch to it, making the active session unambiguous.
**2. Persona library uses simple JSON schema**
- Schema includes: name, description, system_prompt, settings (model, max_turns)
- Rationale: Simple schema is easy to extend later, but provides essential fields for subprocess module to configure Claude Code CLI.
**3. Base directory defaults to ~/homelab/telegram/sessions/**
- Rationale: Bot runs from homelab directory, sessions should be colocated with bot code for easy access.
**4. Metadata read from disk on every access**
- No in-memory caching of session metadata
- Rationale: Avoids stale state issues if multiple processes interact with sessions, keeps implementation simple.
## Deviations from Plan
### Auto-fixed Issues
**1. [Rule 3 - Blocking] Added telegram/__init__.py to make package importable**
- **Found during:** Task 1 verification
- **Issue:** telegram/ directory wasn't a Python package, causing ModuleNotFoundError
- **Fix:** Created telegram/__init__.py with package docstring
- **Files modified:** telegram/__init__.py
- **Verification:** Import succeeded in verification script
- **Committed in:** 447855c (Task 1 commit)
**2. [Rule 3 - Blocking] Fixed SessionManager paths to use homelab directory**
- **Found during:** Task 2 verification
- **Issue:** SessionManager used Path.home() which pointed to /home/mikkel/, but personas were in /home/mikkel/homelab/telegram/
- **Fix:** Changed base_dir and personas_dir initialization to use Path.home() / "homelab" / "telegram"
- **Files modified:** telegram/session_manager.py
- **Verification:** Session creation with persona succeeded
- **Committed in:** ba8acf0 (Task 2 commit)
---
**Total deviations:** 2 auto-fixed (2 blocking)
**Impact on plan:** Both auto-fixes were necessary to make the module functional. No scope changes.
## Issues Encountered
None - straightforward filesystem operations and JSON serialization.
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
**Ready:**
- Session management foundation complete
- Persona library provides template system for subprocess configuration
- Session directory structure ready for Claude Code .claude/ data
**For Phase 02 (Process Management):**
- SessionManager provides get_session_dir() for subprocess working directory
- Session metadata tracks PID and status (idle/active/suspended)
- Persona settings available for subprocess to configure Claude Code CLI flags
**No blockers or concerns.**
---
*Phase: 01-session-process-foundation*
*Completed: 2026-02-04*

View file

@ -1,252 +0,0 @@
---
phase: 01-session-process-foundation
plan: 02
type: execute
wave: 1
depends_on: []
files_modified:
- telegram/claude_subprocess.py
autonomous: true
must_haves:
truths:
- "ClaudeSubprocess spawns Claude Code CLI in a given session directory using asyncio.create_subprocess_exec"
- "Stdout and stderr are read concurrently via asyncio.gather -- no pipe deadlock occurs"
- "Process termination uses terminate() + wait_for() with timeout fallback to kill() -- no zombies"
- "Messages queued while Claude is processing are sent after current response completes"
- "If Claude Code crashes, it auto-restarts with --continue flag and a notification callback fires"
- "Stream-json output is parsed line-by-line, routing assistant/result/system events to callbacks"
artifacts:
- path: "telegram/claude_subprocess.py"
provides: "Claude Code subprocess lifecycle management"
min_lines: 120
contains: "class ClaudeSubprocess"
key_links:
- from: "telegram/claude_subprocess.py"
to: "claude CLI"
via: "asyncio.create_subprocess_exec with PIPE"
pattern: "create_subprocess_exec.*claude"
- from: "telegram/claude_subprocess.py"
to: "asyncio.gather"
via: "concurrent stdout/stderr reading"
pattern: "asyncio\\.gather"
- from: "telegram/claude_subprocess.py"
to: "process cleanup"
via: "terminate + wait_for + kill fallback"
pattern: "terminate.*wait_for|kill.*wait"
---
<objective>
Create the Claude Code subprocess engine that safely spawns, communicates with, and manages Claude Code CLI processes using asyncio.
Purpose: This module is the I/O bridge between session management and Claude Code. It handles the dangerous parts: pipe management without deadlocks, process lifecycle without zombies, message queueing during processing, and crash recovery with session resumption. The research (01-RESEARCH.md) has validated that pipes + stream-json is the correct approach over PTY.
Output: `telegram/claude_subprocess.py` module with `ClaudeSubprocess` class.
</objective>
<execution_context>
@/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md
@/home/mikkel/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-session-process-foundation/01-CONTEXT.md
@.planning/phases/01-session-process-foundation/01-RESEARCH.md
</context>
<tasks>
<task type="auto">
<name>Task 1: Create ClaudeSubprocess module with spawn, I/O, and lifecycle management</name>
<files>telegram/claude_subprocess.py</files>
<action>
Create `telegram/claude_subprocess.py` with a `ClaudeSubprocess` class that manages a single Claude Code CLI subprocess.
**Class design:**
```python
class ClaudeSubprocess:
def __init__(self, session_dir: Path, persona: dict = None,
on_output: Callable = None, on_error: Callable = None,
on_complete: Callable = None, on_status: Callable = None):
"""
Args:
session_dir: Path to session directory (cwd for subprocess)
persona: Persona dict with system_prompt and settings
on_output: Callback(text: str) for assistant text output
on_error: Callback(error: str) for error messages
on_complete: Callback() when a turn completes
on_status: Callback(status: str) for status updates (e.g. "Claude restarted")
"""
```
**Required methods:**
1. `async send_message(message: str) -> None`
- If no process is running: spawn one with this message
- If process IS running and BUSY: queue message (append to internal asyncio.Queue)
- If process IS running and IDLE: send as new turn (spawn new `claude -p` invocation)
- Track processing state (busy vs idle) to know when to queue
2. `async _spawn(message: str) -> None`
- Build Claude Code command:
```
claude -p "<message>"
--output-format stream-json
--verbose
--max-turns <from persona settings, default 25>
--model <from persona settings, default claude-sonnet-4-20250514>
```
- If persona has system_prompt, add: `--system-prompt "<system_prompt>"`
- If `.claude/` exists in session_dir (prior session): add `--continue` flag for history
- Spawn with `asyncio.create_subprocess_exec()`:
- stdout=PIPE, stderr=PIPE
- cwd=str(session_dir)
- env: inherit current env, ensure PATH includes ~/bin and ~/.local/bin
- Store process reference in `self._process`
- Store PID for metadata updates
- Set `self._busy = True`
- Launch concurrent stream readers via `asyncio.create_task(self._read_streams())`
3. `async _read_streams() -> None`
- Use `asyncio.gather()` to read stdout and stderr concurrently (CRITICAL for deadlock prevention)
- stdout handler: `self._handle_stdout_line(line)`
- stderr handler: `self._handle_stderr_line(line)`
- After both streams end: `await self._process.wait()`
- Set `self._busy = False`
- Call `self.on_complete()` callback
- Process queued messages: if `self._message_queue` not empty, pop and `await self.send_message(msg)`
4. `_handle_stdout_line(line: str) -> None`
- Parse as JSON (stream-json format, one JSON object per line)
- Route by event type:
- `"assistant"`: Extract text blocks from `event["message"]["content"]`, call `self.on_output(text)` for each text block
- `"result"`: Turn complete. If `event.get("is_error")`, call `self.on_error(...)`. Log session_id if present.
- `"system"`: Log system events. If subtype is error, call `self.on_error(...)`.
- On `json.JSONDecodeError`: log warning, skip line (Claude Code may emit non-JSON lines)
5. `_handle_stderr_line(line: str) -> None`
- Log as warning (stderr from Claude Code is usually diagnostics, not errors)
- If line contains "error" (case-insensitive), also call `self.on_error(line)`
6. `async terminate(timeout: int = 10) -> None`
- If no process or already terminated (`returncode is not None`): return
- Call `self._process.terminate()` (SIGTERM)
- `await asyncio.wait_for(self._process.wait(), timeout=timeout)`
- On TimeoutError: `self._process.kill()` then `await self._process.wait()` (CRITICAL: always reap)
- Clear `self._process` reference
- Set `self._busy = False`
7. `async _handle_crash() -> None`
- Called when process exits with non-zero return code unexpectedly
- Call `self.on_status("Claude crashed, restarting with context preserved...")` if callback set
- Wait 1 second (backoff)
- Respawn with `--continue` flag (loads most recent session from .claude/ in session_dir)
- If respawn fails 3 times: call `self.on_error("Claude failed to restart after 3 attempts")`
8. `@property is_busy -> bool`
- Return whether subprocess is currently processing a message
9. `@property is_alive -> bool`
- Return whether subprocess process is running (process exists and returncode is None)
**Internal state:**
- `self._process: asyncio.subprocess.Process | None`
- `self._busy: bool`
- `self._message_queue: asyncio.Queue`
- `self._reader_task: asyncio.Task | None`
- `self._crash_count: int` (reset on successful completion)
- `self._session_dir: Path`
- `self._persona: dict | None`
**Implementation notes:**
- Use `asyncio.create_subprocess_exec` (NOT `shell=True` -- avoid shell injection)
- For the env, ensure PATH includes `/home/mikkel/bin:/home/mikkel/.local/bin`
- Add `logging` with `logging.getLogger(__name__)`
- Include type hints for all methods
- Add module docstring explaining the subprocess interaction model
- The `_read_streams` method must handle the case where stdout/stderr complete at different times
- Use `async for line in stream` pattern or `readline()` loop for line-by-line reading
**Read 01-RESEARCH.md** for verified code patterns (Pattern 1: Concurrent Stream Reading, Pattern 3: Stream-JSON Event Handling, Pattern 4: Process Lifecycle Management).
**DO NOT:**
- Import session_manager.py (that module manages metadata; this module manages processes)
- Add Telegram-specific imports (that's Plan 03)
- Implement idle timeout (that's Phase 3)
- Use PTY (research confirms pipes are correct for Claude Code CLI)
</action>
<verify>
```bash
cd ~/homelab
source ~/venv/bin/activate
# Verify module loads and class structure is correct
python3 -c "
import asyncio
from pathlib import Path
from telegram.claude_subprocess import ClaudeSubprocess
# Verify class exists and has required methods
sub = ClaudeSubprocess(
session_dir=Path('/tmp/test-claude-session'),
on_output=lambda text: print(f'OUTPUT: {text}'),
on_error=lambda err: print(f'ERROR: {err}'),
on_complete=lambda: print('COMPLETE'),
on_status=lambda s: print(f'STATUS: {s}')
)
assert hasattr(sub, 'send_message'), 'missing send_message'
assert hasattr(sub, 'terminate'), 'missing terminate'
assert hasattr(sub, 'is_busy'), 'missing is_busy'
assert hasattr(sub, 'is_alive'), 'missing is_alive'
assert not sub.is_busy, 'should start not busy'
assert not sub.is_alive, 'should start not alive'
print('ClaudeSubprocess class structure verified!')
"
# Verify concurrent stream reading implementation exists
python3 -c "
import inspect
from telegram.claude_subprocess import ClaudeSubprocess
source = inspect.getsource(ClaudeSubprocess)
assert 'asyncio.gather' in source, 'Missing asyncio.gather for concurrent stream reading'
assert 'create_subprocess_exec' in source, 'Missing create_subprocess_exec'
assert 'stream-json' in source, 'Missing stream-json output format'
assert 'terminate' in source, 'Missing terminate method'
assert 'wait_for' in source or 'wait(' in source, 'Missing process wait'
print('Implementation patterns verified!')
"
```
</verify>
<done>ClaudeSubprocess class spawns Claude Code CLI with stream-json output in session directories, reads stdout/stderr concurrently via asyncio.gather, handles process lifecycle with clean termination (no zombies), queues messages during processing, and auto-restarts on crash with --continue flag.</done>
</task>
</tasks>
<verification>
1. `telegram/claude_subprocess.py` exists with `ClaudeSubprocess` class
2. Class uses `asyncio.create_subprocess_exec` (not shell=True)
3. Stdout and stderr reading uses `asyncio.gather` for concurrent draining
4. Process termination implements terminate -> wait_for -> kill -> wait pattern
5. Message queue uses `asyncio.Queue` for thread-safe queueing
6. Crash recovery attempts respawn with `--continue` flag, max 3 retries
7. Stream-json parsing handles assistant, result, and system event types
8. No imports from session_manager or telegram bot modules
</verification>
<success_criteria>
- ClaudeSubprocess module loads without import errors
- Class has all required methods and properties
- Implementation uses asyncio.gather for concurrent stream reading (verified via source inspection)
- Process lifecycle follows terminate -> wait pattern (verified via source inspection)
- Module is self-contained with callback-based communication (no tight coupling to session manager or bot)
</success_criteria>
<output>
After completion, create `.planning/phases/01-session-process-foundation/01-02-SUMMARY.md`
</output>

View file

@ -1,122 +0,0 @@
---
phase: 01-session-process-foundation
plan: 02
subsystem: infra
tags: [asyncio, subprocess, python, claude-code-cli, stream-json, process-management]
# Dependency graph
requires:
- phase: 01-session-process-foundation
provides: Session metadata and directory management
provides:
- Claude Code subprocess lifecycle management with crash recovery
- Stream-json event parsing and routing to callbacks
- Concurrent stdout/stderr reading to prevent pipe deadlocks
- Message queueing during Claude processing
- Graceful process termination without zombies
affects: [01-03, telegram-integration, message-handling]
# Tech tracking
tech-stack:
added: []
patterns:
- "Asyncio subprocess management with concurrent stream readers"
- "Stream-json event routing to callback functions"
- "Crash recovery with --continue flag"
- "terminate() + wait_for() + kill() fallback pattern"
key-files:
created:
- telegram/claude_subprocess.py
modified: []
key-decisions:
- "Use asyncio.gather for concurrent stdout/stderr reading (prevents pipe deadlock)"
- "Queue messages during processing, send after completion"
- "Auto-restart on crash with --continue flag (max 3 retries)"
- "Spawn fresh process per turn (not stdin piping) for Phase 1 simplicity"
patterns-established:
- "Pattern 1: Concurrent stream reading - Always use asyncio.gather() for stdout/stderr to prevent pipe buffer overflow deadlocks"
- "Pattern 2: Process lifecycle - terminate() + wait_for(timeout) + kill() + wait() ensures no zombies"
- "Pattern 3: Stream-json parsing - Line-by-line JSON.loads() with try/except, route by event type"
- "Pattern 4: Callback architecture - on_output/on_error/on_complete/on_status for decoupled communication"
# Metrics
duration: 9min
completed: 2026-02-04
---
# Phase 1 Plan 2: ClaudeSubprocess Module Summary
**Asyncio-based Claude Code subprocess engine with concurrent stream reading, message queueing, crash recovery, and stream-json event routing via callbacks**
## Performance
- **Duration:** 9 minutes
- **Started:** 2026-02-04T17:32:07Z
- **Completed:** 2026-02-04T17:41:16Z
- **Tasks:** 1
- **Files modified:** 1
## Accomplishments
- Created ClaudeSubprocess class that spawns Claude Code CLI with stream-json output
- Implemented concurrent stdout/stderr reading via asyncio.gather to prevent pipe deadlocks
- Built message queueing system for messages received during Claude processing
- Implemented crash recovery with auto-restart using --continue flag (max 3 retries)
- Added graceful process termination with terminate → wait_for → kill → wait pattern (no zombies)
- Established callback-based architecture for decoupled communication with session manager
## Task Commits
Each task was committed atomically:
1. **Task 1: Create ClaudeSubprocess module with spawn, I/O, and lifecycle management** - `8fce10c` (feat)
## Files Created/Modified
- `telegram/claude_subprocess.py` - Claude Code subprocess lifecycle management with asyncio, handles spawning with persona settings, concurrent stream reading, stream-json event parsing, message queueing, crash recovery, and graceful termination
## Decisions Made
1. **Concurrent stream reading pattern**: Use `asyncio.gather()` to read stdout and stderr concurrently, preventing pipe buffer deadlock (verified by research as critical pattern)
2. **Message queueing strategy**: Queue messages in `asyncio.Queue` while subprocess is busy, process queue after completion callback. This ensures messages don't interrupt active processing.
3. **Crash recovery approach**: Auto-restart with `--continue` flag up to 3 times with 1-second backoff. Claude Code's session persistence in `.claude/` directory enables context preservation across crashes.
4. **Fresh process per turn**: Spawn new `claude -p` invocation for each turn rather than piping to stdin. Simpler for Phase 1; Phase 2+ might use `--input-format stream-json` for live piping.
5. **Callback architecture**: Decouple subprocess management from session management via callbacks (on_output, on_error, on_complete, on_status). Enables clean separation of concerns.
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None - implementation followed research patterns without issues.
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
**Ready for Plan 03 (Telegram Integration):**
- ClaudeSubprocess provides complete subprocess lifecycle management
- Callback architecture enables clean integration with Telegram bot message handlers
- Message queueing handles concurrent messages during processing
- Process termination and crash recovery are production-ready
**Integration points for Plan 03:**
- Pass Telegram message text to `send_message()`
- Route `on_output` callback to Telegram message sending
- Route `on_error` callback to Telegram error notifications
- Use `on_status` callback for typing indicators
- Call `terminate()` during session cleanup/switching
**No blockers or concerns.**
---
*Phase: 01-session-process-foundation*
*Completed: 2026-02-04*

View file

@ -1,268 +0,0 @@
---
phase: 01-session-process-foundation
plan: 03
type: execute
wave: 2
depends_on: ["01-01", "01-02"]
files_modified:
- telegram/bot.py
autonomous: false
must_haves:
truths:
- "User sends /new myproject in Telegram and receives confirmation that session was created"
- "User sends /new myproject brainstorm and session is created with brainstorm persona"
- "User sends /session myproject in Telegram and active session switches"
- "User sends plain text with no active session and gets prompted to create one"
- "User sends plain text with active session and message is routed to ClaudeSubprocess"
- "/new with duplicate name returns friendly error, not crash"
- "No zombie processes after switching sessions"
artifacts:
- path: "telegram/bot.py"
provides: "Bot with /new and /session commands wired to session manager and subprocess"
contains: "SessionManager"
key_links:
- from: "telegram/bot.py"
to: "telegram/session_manager.py"
via: "SessionManager import and method calls"
pattern: "from telegram\\.session_manager import|SessionManager"
- from: "telegram/bot.py"
to: "telegram/claude_subprocess.py"
via: "ClaudeSubprocess import for process spawning"
pattern: "from telegram\\.claude_subprocess import|ClaudeSubprocess"
- from: "telegram/bot.py /new handler"
to: "SessionManager.create_session"
via: "direct method call"
pattern: "create_session"
- from: "telegram/bot.py /session handler"
to: "SessionManager.switch_session"
via: "direct method call"
pattern: "switch_session"
---
<objective>
Wire the session manager and Claude subprocess modules into the existing Telegram bot, adding `/new` and `/session` commands and routing plain text messages to the active session's Claude Code subprocess.
Purpose: This plan connects the foundation pieces (Plans 01 and 02) to the existing Telegram bot. After this plan, users can create sessions, switch between them, and send messages that spawn Claude Code subprocesses. This completes Phase 1's core goal: path-based sessions with subprocess management.
Output: Updated `telegram/bot.py` with new command handlers and session-aware message routing.
</objective>
<execution_context>
@/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md
@/home/mikkel/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-session-process-foundation/01-CONTEXT.md
@.planning/phases/01-session-process-foundation/01-RESEARCH.md
@.planning/phases/01-session-process-foundation/01-01-SUMMARY.md
@.planning/phases/01-session-process-foundation/01-02-SUMMARY.md
@telegram/bot.py
@telegram/session_manager.py
@telegram/claude_subprocess.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Add /new and /session commands to bot.py</name>
<files>telegram/bot.py</files>
<action>
Modify the existing `telegram/bot.py` to add session management commands and wire up the Claude subprocess for message handling.
**New imports to add:**
```python
from telegram.session_manager import SessionManager
from telegram.claude_subprocess import ClaudeSubprocess
```
**Module-level state:**
- Create a `SessionManager` instance (singleton for the bot process)
- Create a dict `subprocesses: dict[str, ClaudeSubprocess]` to track subprocess per session — **CRITICAL: this dict persists ALL session subprocesses, not just the active one. When switching from session A to B, the subprocess for A stays alive in `subprocesses['A']` with status "suspended" via SessionManager. This implements the locked decision: "Switching sessions suspends (not kills) the current process — it stays alive. No limit on concurrent live Claude Code processes."**
**New command: `/new <name> [persona]`**
Handler: `async def new_session(update, context)`
- Extract name from `context.args[0]` (required)
- Extract persona from `context.args[1]` if provided (optional)
- Validate: if no args, reply with usage: "Usage: /new <name> [persona]"
- Call `session_manager.create_session(name, persona=persona)`
- On ValueError (duplicate name): reply "Session '<name>' already exists. Use /session <name> to switch to it."
- On success:
- Auto-switch to new session: call `session_manager.switch_session(name)`
- Reply "Session '<name>' created." (include persona name if specified: "Session '<name>' created with persona '<persona>'.")
- Do NOT auto-spawn subprocess yet (spawn happens on first message, per CONTEXT.md: "Switching to a session with no running process auto-spawns Claude Code immediately" -- but creating is not switching. Actually, /new does auto-switch, so it should auto-spawn. However, we can defer spawn to first message for simplicity. Let Claude handle this: spawn subprocess immediately after creation since we auto-switch.)
- Spawn a ClaudeSubprocess for the new session (but don't send a message yet -- it's just ready)
**New command: `/session <name>`**
Handler: `async def switch_session_cmd(update, context)`
- Extract name from `context.args[0]` (required)
- Validate: if no args, reply with usage and list available sessions
- If session doesn't exist: reply "Session '<name>' doesn't exist. Use /new <name> to create it."
- Call `session_manager.switch_session(name)` — this marks the PREVIOUS session as "suspended" in metadata (subprocess stays alive in `subprocesses` dict)
- **MANDATORY auto-spawn:** After switching, check if `subprocesses.get(name)` is None or `not subprocesses[name].is_alive`. If so, create a new ClaudeSubprocess with session directory and persona, store in `subprocesses[name]`. This implements the locked decision: "Switching to a session with no running process auto-spawns Claude Code immediately." The subprocess idles waiting for first message.
- Reply "Switched to session '<name>'." (include persona if set)
**Modified message handler: `handle_message`**
Replace the current implementation (which saves to inbox) with session-aware routing:
- If no active session: reply "No active session. Use /new <name> to start one."
- If active session exists:
- Get or create ClaudeSubprocess for session (auto-spawn if not alive)
- Call `await subprocess.send_message(update.message.text)` — **Note: ClaudeSubprocess handles internal message queueing via asyncio.Queue (Plan 02 design). If the subprocess is busy processing a previous message, `send_message()` queues the message and it will be sent after the current response completes. The bot handler does NOT need to check `is_busy` — just call `send_message()` and the subprocess manages the queue.**
- The on_output callback will need to send messages back to the chat. For Phase 1, create callbacks that send to the Telegram chat:
```python
async def on_output(text):
await update.message.reply_text(text)
```
- Note: For Phase 1, basic output routing is needed. Full Telegram integration (typing indicators, message batching, code block handling) comes in Phase 2.
- Register handler with `block=False` for non-blocking execution (per research: prevents blocking Telegram event loop)
**Subprocess callback factory:**
Create a helper function that generates callbacks bound to a specific chat_id and bot instance:
```python
def make_callbacks(bot, chat_id):
async def on_output(text):
# Truncate to Telegram limit
if len(text) > 4000:
text = text[:4000] + "\n... (truncated)"
await bot.send_message(chat_id=chat_id, text=text)
async def on_error(error):
await bot.send_message(chat_id=chat_id, text=f"Error: {error}")
async def on_complete():
pass # Phase 2 will add typing indicator cleanup
async def on_status(status):
await bot.send_message(chat_id=chat_id, text=f"[{status}]")
return on_output, on_error, on_complete, on_status
```
**Important: callback async handling.** The ClaudeSubprocess callbacks may be called from a background task. Since `bot.send_message` is async, the callbacks need to be async and properly awaited. The subprocess module should use `asyncio.ensure_future()` or check if callbacks are coroutines.
Actually, reconsider: the subprocess stream reader runs in a background asyncio.Task. Callbacks called from there ARE in the event loop already. So async callbacks work naturally. The subprocess module should `await callback(text)` if the callback is a coroutine, or call `asyncio.create_task(callback(text))` to not block the reader.
**Handler registration updates:**
Add to `main()`:
```python
app.add_handler(CommandHandler("new", new_session))
app.add_handler(CommandHandler("session", switch_session_cmd))
```
Update the text message handler to use `block=False`:
```python
app.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, handle_message, block=False))
```
**Update /help text:** Add /new and /session to the help output.
**Keep all existing commands working.** Do not remove or modify `/status`, `/pbs`, `/backups`, etc. Only modify `handle_message` and add new handlers.
**Authorization:** All new handlers must check `is_authorized(update.effective_user.id)` as the first line, following existing pattern.
**DO NOT:**
- Remove or break existing command handlers
- Implement typing indicators (Phase 2)
- Implement message batching or code block handling (Phase 2)
- Implement idle timeout (Phase 3)
- Make the bot.py import path different from existing pattern
</action>
<verify>
```bash
cd ~/homelab
source ~/venv/bin/activate
# Verify bot.py loads without import errors
python3 -c "
import sys
sys.path.insert(0, '.')
# Just verify imports work - don't run the bot
from telegram.session_manager import SessionManager
from telegram.claude_subprocess import ClaudeSubprocess
print('Imports successful')
"
# Verify new handlers are registered in the code
python3 -c "
source = open('telegram/bot.py').read()
assert 'CommandHandler(\"new\"' in source or 'CommandHandler(\"new\",' in source, 'Missing /new handler registration'
assert 'CommandHandler(\"session\"' in source or 'CommandHandler(\"session\",' in source, 'Missing /session handler registration'
assert 'SessionManager' in source, 'Missing SessionManager usage'
assert 'ClaudeSubprocess' in source, 'Missing ClaudeSubprocess usage'
assert 'block=False' in source, 'Missing block=False for non-blocking handler'
assert 'is_authorized' in source, 'Authorization checks must be present'
print('Handler registration and imports verified!')
"
# Verify existing handlers still present
python3 -c "
source = open('telegram/bot.py').read()
existing = ['status', 'pbs', 'backups', 'beszel', 'kuma', 'ping', 'help']
for cmd in existing:
assert f'CommandHandler(\"{cmd}\"' in source or f'CommandHandler(\"{cmd}\",' in source, f'Existing /{cmd} handler missing!'
print('All existing handlers preserved!')
"
```
</verify>
<done>/new and /session commands are registered in bot.py, SessionManager and ClaudeSubprocess are wired in, plain text messages route to active session's subprocess (or prompt if no session), all existing bot commands still work, handlers use block=False for non-blocking execution.</done>
</task>
<task type="checkpoint:human-verify" gate="blocking">
<name>Task 2: Verify session commands work in Telegram</name>
<what-built>
Session management commands (/new, /session) integrated into the Telegram bot, with Claude Code subprocess spawning and basic message routing. The bot should handle session creation, switching, and message forwarding to Claude Code.
</what-built>
<how-to-verify>
1. Restart the Telegram bot service:
```
systemctl --user restart telegram-bot.service
systemctl --user status telegram-bot.service
```
2. In Telegram, send `/new test-session` -- expect confirmation "Session 'test-session' created."
3. Send a plain text message like "Hello, what can you do?" -- expect Claude's response back in Telegram
4. Send `/new second brainstorm` -- expect "Session 'second' created with persona 'brainstorm'."
5. Send `/session test-session` -- expect "Switched to session 'test-session'."
6. Send another message -- should go to test-session's Claude (different context from second)
7. Check for zombie processes: `ps aux | grep defunct` -- should be empty
8. Check session directories exist: `ls ~/telegram/sessions/` -- should show test-session and second
9. Send `/new test-session` (duplicate) -- expect friendly error, not crash
10. Send a message without active session (if possible to test): should get "No active session" prompt
</how-to-verify>
<resume-signal>Type "approved" if sessions work correctly, or describe any issues found.</resume-signal>
</task>
</tasks>
<verification>
1. Bot starts without errors after code changes
2. `/new <name>` creates session directory with metadata.json and persona.json
3. `/session <name>` switches active session
4. Plain text messages route to active session's Claude Code subprocess
5. Claude Code responses come back to Telegram chat
6. All existing commands (/status, /pbs, /backups, etc.) still work
7. No zombie processes after session switches
8. Duplicate /new returns friendly error
9. Message with no active session prompts user
</verification>
<success_criteria>
- `/new test` creates session and confirms in Telegram
- `/session test` switches session and confirms
- Plain text messages trigger Claude Code and response appears in Telegram
- Existing bot commands remain functional
- No zombie processes (verified via `ps aux | grep defunct`)
- Bot service runs stably under systemd
</success_criteria>
<output>
After completion, create `.planning/phases/01-session-process-foundation/01-03-SUMMARY.md`
</output>

View file

@ -1,165 +0,0 @@
---
phase: 01-session-process-foundation
plan: 03
subsystem: infra
tags: [telegram-bot, integration, commands, session-management]
requires:
- plan: 01-01
provides: SessionManager class, persona library
- plan: 01-02
provides: ClaudeSubprocess class, stream-json parsing
provides:
- /new and /session commands in Telegram bot
- /archive command for session cleanup
- Message routing to Claude Code subprocess
- Callback-based response delivery to Telegram
- Session-aware bot handler
affects: [telegram-bot, phase-02-persistent-processes]
tech-stack:
added: []
patterns:
- "Session-aware message routing in Telegram handlers"
- "Async callback factory for subprocess → Telegram bridge"
- "Callback-based inter-component communication"
key-files:
created: []
modified:
- telegram/bot.py
key-decisions:
- "Sibling imports instead of package imports (avoids telegram namespace collision with pip package)"
- "Fresh claude -p process per message turn (Phase 1 simplicity, contradicts persistent process goal)"
- "Archive sessions with tar+pigz to sessions_archive/"
- "Block=False on message handler for non-blocking execution"
patterns-established:
- "make_callbacks factory binds bot + chat_id for subprocess callbacks"
- "Session metadata read on-demand from disk"
- "Personas loaded from disk without caching"
duration: 15min
completed: 2026-02-04
---
# Phase 01 Plan 03: Bot Command Integration Summary
**Integrated SessionManager and ClaudeSubprocess into Telegram bot with /new, /session, /archive commands and session-aware message routing**
## Performance
- Duration: 15 min
- Tasks: 2 (1 auto, 1 human-verify)
- Files modified: 1
- Commands added: 3
- Commits: 1 task + 7 orchestrator fixes
## Accomplishments
### Primary Features
- Added `/new` command to create and activate new sessions
- Added `/session` command to list and switch between sessions
- Added `/archive` command to compress and cleanup sessions
- Session-aware message routing:
- No active session → prompts user to create or select
- Active session → routes message to Claude subprocess
- Callback-based response delivery from subprocess to Telegram
- All existing bot commands preserved and working
### Quality & Debugging
- Non-blocking message handler (block=False) prevents bot freezes
- [TIMING] profiler instrumentation added for response latency debugging
- Async callback factory pattern decouples subprocess from session manager
- Proper SIGTERM signal handling (subprocess completion, not crashes)
## Task Completion
| Task | Name | Status | Commit |
|------|------|--------|--------|
| 1 | Add /new and /session commands | Complete | 3a62e01 |
| 2 | Human verify session flow | Complete | approved |
## Orchestrator Fixes Applied
| Commit | Type | Description |
|--------|------|-------------|
| d08f352 | fix | Import collision: telegram/ shadowed pip telegram package |
| 3cc97ad | fix | SIGTERM not treated as crash + async callbacks in crash handler |
| a27ac01 | feat | Add /archive command for session cleanup |
| faab47a | refactor | Restructure personas (default, homelab, brainstorm, planner, research) |
| 3fec4ed | fix | Update persona models to use aliases not versions |
| 2302b75 | fix | Fix persona settings nested under "settings" object |
| (pending) | perf | [TIMING] profiler instrumentation for response latency |
## Decisions Made
1. **Import Strategy:** Sibling imports (`from session_manager import`) instead of package imports to avoid shadowing the pip `telegram` package
2. **Process Lifecycle:** Fresh `claude -p` process per message turn for Phase 1 simplicity (not persistent)
3. **Session Activation:** Creating a session does not activate it; must use /session command (prevents accidental switches)
4. **Archive Format:** tar+pigz compression for session archives stored in `sessions_archive/`
## Known Gaps & Deviations
### Gap: Non-persistent Subprocess Model
The implementation spawns a fresh `claude -p` process per message turn. This contradicts the design decision:
- **Original goal:** "Switching sessions suspends (doesn't kill) current process — stays alive"
- **Current behavior:** Fresh process each turn
- **Impact:**
- ~1s CLI startup overhead per message
- No session continuity without --continue flag
- Every message is a cold-start
- No streaming responses or typing indicators
**Status:** Known Phase 1 simplification. Should be addressed in Phase 2 with persistent process model.
### Auto-fixed Issues (Not in Original Plan)
1. **Import Collision** (d08f352)
- telegram/ directory __init__.py shadowed pip telegram package
- Fixed: Restructured imports to use sibling module pattern
2. **SIGTERM Handling** (3cc97ad)
- SIGTERM interpreted as crash during service restarts
- Fixed: Properly distinguished between SIGTERM (clean exit) and actual crashes
3. **Persona Settings Bug** (2302b75)
- Persona settings nested under "settings" object, not being read
- Fixed: Flattened settings structure in persona models
4. **Hardcoded Model Versions** (3fec4ed)
- Model versions from training data instead of using aliases
- Fixed: Updated to use model aliases (default: claude-opus-4-5-20251101)
## Deviations Summary
**Auto-fixed (Rule 2 - Critical):** 4 issues fixed automatically (import collision, signal handling, persona settings, hardcoded versions)
**Planned Gaps:** 1 known gap documented (non-persistent processes → Phase 2)
## Issues Encountered
- Subprocess model flag not being passed to Claude (persona settings nesting)
- Response latency ~10s (API wait + no streaming feedback)
- Fresh process overhead ~1s per message turn
## Next Phase Readiness
**Status:** Phase 1 Complete, Phase 2 Ready (with known gaps documented)
**Readiness assessment:**
- Session management: ✓ Complete
- Subprocess integration: ✓ Complete
- Bot commands: ✓ Complete
- Message routing: ✓ Complete
**Blockers for Phase 2:**
- Persistent process model (replace one-shot `claude -p`)
- Streaming responses needed for UX (typing indicators)
- Message batching strategy for long responses
- Response latency profiling and optimization
---
*Phase 01 Plan 03 | Completed 2026-02-04 | Duration 15 min*

View file

@ -1,72 +0,0 @@
# Phase 1: Session & Process Foundation - Context
**Gathered:** 2026-02-04
**Status:** Ready for planning
<domain>
## Phase Boundary
Path-based sessions with isolated directories that spawn and manage Claude Code subprocesses. Users create named sessions, switch between them, and each session runs its own Claude Code process. Telegram integration (messaging, typing indicators, file handling) is Phase 2. Idle timeout and suspend/resume lifecycle is Phase 3.
</domain>
<decisions>
## Implementation Decisions
### Session structure
- Each session lives at `~/telegram/sessions/<name>/`
- Session directory contains a `persona.json` defining Claude's personality/behavior for that session
- Shared persona library at `~/telegram/personas/` with reusable templates (e.g. `brainstorm.json`, `planner.json`, `research.json`)
- Sessions reference a library persona by name, but can override/customize locally
- Metadata tracked per session — Claude decides what's useful (name, timestamps, PID, etc.)
### Session switching behavior
- Switching sessions suspends (not kills) the current process — it stays alive
- The idle timeout from Phase 3 handles graceful cleanup of suspended processes
- No limit on concurrent live Claude Code processes — idle timeout manages resources
- Switching to a session with no running process auto-spawns Claude Code immediately
- Messages queue to the newly active session
### Subprocess interaction model
- I/O model (PTY vs pipes + stream-json): Claude's discretion — research what works best
- Output buffering vs streaming: Claude's discretion — pick based on CLI capabilities
- Messages received while Claude is processing are queued, sent after current response completes
- If Claude Code crashes: auto-restart with --resume and notify user ("Claude restarted, context preserved")
### Command design
- `/new <name> [persona]` — create session, optionally set persona at creation time
- `/session <name>` — switch to existing session (auto-spawns process if needed)
- Session names: permissive — anything that works as a directory name (alphanumeric, hyphens, underscores)
- Name conflict on /new: Claude's discretion on least-surprising behavior
- No active session + plain message: prompt user with "No active session. Use /new <name> to start one."
### Claude's Discretion
- Session metadata fields beyond basics
- I/O model choice (PTY vs pipes + stream-json)
- Output buffering strategy
- Name conflict handling on /new
- Persona JSON schema/fields
</decisions>
<specifics>
## Specific Ideas
- Personas should feel like "modes" — brainstorming personality is different from planning personality, research personality, casual chat, etc.
- Persona library is shared templates, but each session can customize — like CSS inheritance
- The system should feel lightweight — creating a session should be instant
</specifics>
<deferred>
## Deferred Ideas
- Persona management commands in Telegram (`/persona list`, `/persona show <name>`, `/persona create`) — belongs in a future phase, not Phase 1
- In-bot persona editing — for now, manage persona files on filesystem
</deferred>
---
*Phase: 01-session-process-foundation*
*Context gathered: 2026-02-04*

View file

@ -1,565 +0,0 @@
# Phase 1: Session & Process Foundation - Research
**Researched:** 2026-02-04
**Domain:** Python asyncio subprocess management, Claude Code CLI integration, Telegram bot architecture
**Confidence:** HIGH
## Summary
Phase 1 requires spawning and managing Claude Code CLI subprocesses from a Telegram bot written in Python using python-telegram-bot 22.5 and asyncio. The core technical challenge is safely managing subprocess I/O without deadlocks while handling concurrent Telegram messages.
Research confirms that asyncio provides robust subprocess management primitives, and Claude Code CLI's `--output-format stream-json` provides structured, parseable output ideal for subprocess consumption. The standard pattern is pipes with concurrent stream readers using `asyncio.gather()`, not PTY, as Claude Code doesn't require interactive terminal features for this use case.
Key findings: (1) Always use `communicate()` or concurrent stream readers to avoid pipe deadlocks, (2) Claude Code sessions are directory-based and persistent via `--resume`, (3) python-telegram-bot 22.5 handles async natively but requires careful handler design to avoid blocking, (4) Process cleanup must use `terminate()` + `wait()` to prevent zombie processes.
**Primary recommendation:** Use `asyncio.create_subprocess_exec()` with `PIPE` for stdout/stderr, concurrent `asyncio.gather()` for stream reading, and Claude Code's `--output-format stream-json --verbose` for structured output. Skip PTY complexity unless future phases need interactive features.
## Standard Stack
The established libraries/tools for this domain:
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| python-telegram-bot | 22.5 | Telegram bot framework | Industry standard for Python Telegram bots, native async/await, comprehensive API coverage |
| asyncio | stdlib (3.12+) | Async subprocess management | Python's official async framework, subprocess primitives prevent deadlocks |
| Claude Code CLI | 2.1.31+ | AI agent subprocess | Official CLI with --resume, session persistence, stream-json output |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| json | stdlib | Parse stream-json output | Every subprocess output line (NDJSON format) |
| pathlib | stdlib | Session directory management | File/directory operations for `~/telegram/sessions/` |
| typing | stdlib | Type hints for session metadata | Code clarity and IDE support |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| asyncio.create_subprocess_exec | pty.spawn + asyncio | PTY adds complexity (terminal emulation, signal handling) without benefit for non-interactive CLI |
| python-telegram-bot | aiogram | aiogram is also async but has smaller ecosystem, PTB is more mature |
| Pipes | PTY (pseudo-terminal) | PTY needed only for programs requiring terminal features (color codes, cursor control) - Claude Code works fine with pipes |
**Installation:**
```bash
# Already installed on mgmt container
source ~/venv/bin/activate
pip show python-telegram-bot # Version: 22.5
which claude # /home/mikkel/.local/bin/claude
claude --version # 2.1.31 (Claude Code)
```
## Architecture Patterns
### Recommended Project Structure
```
telegram/
├── bot.py # Existing bot entry point
├── sessions/ # NEW: Session storage
│ ├── <name>/ # Per-session directory
│ │ ├── metadata.json # Session state (PID, timestamps, persona)
│ │ └── .claude/ # Claude Code session data (auto-created)
├── personas/ # NEW: Persona library
│ ├── brainstorm.json # Shared persona templates
│ ├── planner.json
│ └── research.json
├── session_manager.py # NEW: Session lifecycle management
└── claude_subprocess.py # NEW: Subprocess I/O handling
```
### Pattern 1: Concurrent Stream Reading (CRITICAL)
**What:** Read stdout and stderr concurrently using `asyncio.gather()` to prevent pipe buffer overflow
**When to use:** Every subprocess with `PIPE` for stdout/stderr
**Example:**
```python
# Source: https://docs.python.org/3/library/asyncio-subprocess.html
import asyncio
async def read_stream(stream, callback):
"""Read stream line by line, invoke callback for each line."""
while True:
line = await stream.readline()
if not line:
break
callback(line.decode().rstrip())
async def run_claude(session_dir, message):
proc = await asyncio.create_subprocess_exec(
'claude', '-p', message,
'--output-format', 'stream-json',
'--verbose',
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=session_dir
)
# Concurrent reading prevents deadlock
await asyncio.gather(
read_stream(proc.stdout, handle_stdout),
read_stream(proc.stderr, handle_stderr)
)
await proc.wait()
```
### Pattern 2: Session Directory Isolation
**What:** Each session gets its own directory; Claude Code automatically manages session state
**When to use:** Every session creation/switch
**Example:**
```python
# Source: Phase context + Claude Code CLI reference
from pathlib import Path
import json
def create_session(name: str, persona: str = None):
"""Create new session with isolated directory."""
session_dir = Path.home() / 'telegram' / 'sessions' / name
session_dir.mkdir(parents=True, exist_ok=True)
metadata = {
'name': name,
'created': datetime.now().isoformat(),
'persona': persona,
'pid': None,
'status': 'idle'
}
# Write metadata
(session_dir / 'metadata.json').write_text(json.dumps(metadata, indent=2))
# Copy persona if specified
if persona:
persona_file = Path.home() / 'telegram' / 'personas' / f'{persona}.json'
if persona_file.exists():
(session_dir / 'persona.json').write_text(persona_file.read_text())
return session_dir
```
### Pattern 3: Stream-JSON Event Handling
**What:** Parse newline-delimited JSON events from Claude Code output
**When to use:** Processing subprocess output in real-time
**Example:**
```python
# Source: https://code.claude.com/docs/en/headless + stream-json research
import json
def handle_stdout(line: str):
"""Parse and route stream-json events."""
try:
event = json.loads(line)
event_type = event.get('type')
if event_type == 'assistant':
# Claude's response
content = event['message']['content']
for block in content:
if block['type'] == 'text':
send_to_telegram(block['text'])
elif event_type == 'result':
# Task complete
session_id = event['session_id']
update_session_state(session_id, 'idle')
elif event_type == 'system':
# System events (hooks, init)
pass
except json.JSONDecodeError:
logger.warning(f"Invalid JSON: {line}")
```
### Pattern 4: Process Lifecycle Management
**What:** Spawn on session switch, suspend (don't kill), rely on Phase 3 timeout for cleanup
**When to use:** Session switching, process termination
**Example:**
```python
# Source: Asyncio subprocess best practices + Phase context decisions
import asyncio
import signal
async def switch_session(new_session: str):
"""Switch to new session, suspend current process."""
current = get_active_session()
# Mark current as suspended (don't kill)
if current and current.proc:
current.status = 'suspended'
save_metadata(current)
# Process stays alive, Phase 3 timeout handles cleanup
# Activate new session
new = load_session(new_session)
if not new.proc or new.proc.returncode is not None:
# No process or dead - spawn new one
new.proc = await spawn_claude(new.session_dir)
set_active_session(new)
async def terminate_gracefully(proc, timeout=10):
"""Terminate subprocess with timeout, prevent zombies."""
# Source: Python asyncio subprocess best practices research
try:
proc.terminate() # Send SIGTERM
await asyncio.wait_for(proc.wait(), timeout=timeout)
except asyncio.TimeoutError:
proc.kill() # Force SIGKILL
await proc.wait() # CRITICAL: Always await to prevent zombies
```
### Pattern 5: Non-Blocking Telegram Handlers
**What:** Use `block=False` for handlers that spawn long-running tasks
**When to use:** Message handlers that interact with Claude Code subprocess
**Example:**
```python
# Source: https://github.com/python-telegram-bot/python-telegram-bot/wiki/Concurrency
from telegram.ext import Application, MessageHandler, filters
async def handle_message(update, context):
"""Handle incoming Telegram messages."""
session = get_active_session()
if not session:
await update.message.reply_text("No active session. Use /new <name>")
return
# Queue message to subprocess (non-blocking)
await session.send_message(update.message.text)
# Register with block=False for concurrency
app.add_handler(MessageHandler(
filters.TEXT & ~filters.COMMAND,
handle_message,
block=False
))
```
### Anti-Patterns to Avoid
- **Direct stream reading without concurrency:** Calling `await proc.stdout.read()` then `await proc.stderr.read()` sequentially will deadlock if stderr fills up first
- **Using `wait()` with pipes:** `await proc.wait()` deadlocks if stdout/stderr buffers fill; always use `communicate()` or concurrent stream readers
- **Killing processes without cleanup:** `proc.kill()` without `await proc.wait()` creates zombie processes
- **PTY for non-interactive programs:** PTY adds signal handling complexity; Claude Code CLI works fine with pipes
## Don't Hand-Roll
Problems that look simple but have existing solutions:
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Concurrent stream reading | Manual threading or sequential reads | `asyncio.gather()` with StreamReader | Prevents deadlocks, handles backpressure, battle-tested |
| JSON Lines parsing | Custom line-by-line JSON parser | `json.loads()` per line with try/except | Standard library is fast, handles edge cases |
| Session ID generation | Custom UUID logic | `uuid.uuid4()` from stdlib | Cryptographically secure, standard format |
| Process termination | Manual signal handling | `proc.terminate()` + `asyncio.wait_for(proc.wait())` | Handles timeout, cleanup, zombie prevention |
**Key insight:** Asyncio subprocess management has well-documented pitfalls (deadlocks, zombies). Use standard patterns from official docs rather than custom solutions.
## Common Pitfalls
### Pitfall 1: Pipe Deadlock from Sequential Reading
**What goes wrong:** Reading stdout then stderr sequentially causes deadlock if stderr fills buffer first
**Why it happens:** OS pipe buffers are finite (~64KB). If stderr fills while code waits on stdout, child process blocks writing, parent blocks reading - deadlock.
**How to avoid:** Always read stdout and stderr concurrently using `asyncio.gather()`
**Warning signs:** Subprocess hangs indefinitely, no output, high CPU usage from blocked I/O
```python
# WRONG - Sequential reading
stdout_data = await proc.stdout.read() # Blocks forever if stderr fills first
stderr_data = await proc.stderr.read()
# RIGHT - Concurrent reading
async def read_all(stream):
return await stream.read()
stdout_data, stderr_data = await asyncio.gather(
read_all(proc.stdout),
read_all(proc.stderr)
)
```
### Pitfall 2: Zombie Processes from Missing wait()
**What goes wrong:** Process terminates but stays in zombie state (shows as `<defunct>` in ps)
**Why it happens:** Parent must call `wait()` to let OS reclaim process resources. Forgetting this after `terminate()`/`kill()` leaves zombies.
**How to avoid:** ALWAYS `await proc.wait()` after termination, even after `kill()`
**Warning signs:** `ps aux` shows increasing number of `<defunct>` processes, eventual resource exhaustion
```python
# WRONG - Zombie process
proc.terminate()
# Process is now zombie - resources not reclaimed
# RIGHT - Clean termination
proc.terminate()
await proc.wait() # CRITICAL - reaps zombie
```
### Pitfall 3: Blocking Telegram Bot Event Loop
**What goes wrong:** Long-running subprocess operations freeze bot, no messages processed
**Why it happens:** Telegram handlers run on main event loop. Blocking operations (like `communicate()` on long-running process) block all handlers.
**How to avoid:** Use `block=False` in handler registration, or spawn background tasks with `asyncio.create_task()`
**Warning signs:** Bot becomes unresponsive during Claude Code processing, commands queue up
```python
# WRONG - Blocks event loop
async def handle_message(update, context):
stdout, stderr = await proc.communicate() # Blocks for minutes
await update.message.reply_text(stdout)
# RIGHT - Non-blocking handler
app.add_handler(MessageHandler(
filters.TEXT,
handle_message,
block=False # Runs as asyncio.Task
))
```
### Pitfall 4: Assuming Claude Code Session Isolation
**What goes wrong:** Spawning multiple Claude Code processes in same directory causes session conflicts
**Why it happens:** Claude Code manages session state in `.claude/` subdirectory. Multiple processes in same directory share session state, corrupting history.
**How to avoid:** Each session must have its own directory (`~/telegram/sessions/<name>/`). Change `cwd` parameter when spawning subprocess.
**Warning signs:** Session history mixed between conversations, `--resume` loads wrong context
```python
# WRONG - Shared directory
proc = await asyncio.create_subprocess_exec('claude', '-p', msg)
# RIGHT - Isolated directory per session
session_dir = Path.home() / 'telegram' / 'sessions' / session_name
proc = await asyncio.create_subprocess_exec(
'claude', '-p', msg,
cwd=str(session_dir)
)
```
### Pitfall 5: Ignoring stream-json Event Types
**What goes wrong:** Only handling 'assistant' events misses errors, tool confirmations, completion status
**Why it happens:** stream-json emits multiple event types (system, assistant, result). Parsing only one type loses critical information.
**How to avoid:** Handle all event types in stream parser, especially 'result' for completion status and 'system' for errors
**Warning signs:** Missing error notifications, unclear when Claude finishes processing, tool use not tracked
```python
# WRONG - Only handles assistant messages
if event['type'] == 'assistant':
send_to_telegram(event['message'])
# RIGHT - Handle all event types
if event['type'] == 'assistant':
send_to_telegram(event['message'])
elif event['type'] == 'result':
mark_session_complete(event)
elif event['type'] == 'system' and event.get('subtype') == 'error':
notify_user_error(event)
```
## Code Examples
Verified patterns from official sources:
### Creating and Managing Subprocess
```python
# Source: https://docs.python.org/3/library/asyncio-subprocess.html
import asyncio
from pathlib import Path
async def spawn_claude_subprocess(session_dir: Path, initial_message: str):
"""Spawn Claude Code subprocess for session."""
proc = await asyncio.create_subprocess_exec(
'claude',
'-p', initial_message,
'--output-format', 'stream-json',
'--verbose',
'--continue', # Resume session if exists
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=str(session_dir)
)
return proc
```
### Concurrent Stream Reading
```python
# Source: https://docs.python.org/3/library/asyncio-subprocess.html
async def read_stream(stream, callback):
"""Read stream line-by-line, invoke callback for each line."""
while True:
line = await stream.readline()
if not line:
break
callback(line.decode().rstrip())
async def run_with_stream_handlers(proc, stdout_handler, stderr_handler):
"""Run subprocess with concurrent stdout/stderr reading."""
await asyncio.gather(
read_stream(proc.stdout, stdout_handler),
read_stream(proc.stderr, stderr_handler),
proc.wait()
)
```
### Graceful Process Termination
```python
# Source: Python asyncio subprocess research (multiple sources)
import asyncio
async def terminate_process(proc, timeout: int = 10):
"""Terminate subprocess gracefully, prevent zombie."""
if proc.returncode is not None:
return # Already terminated
try:
proc.terminate() # Send SIGTERM
await asyncio.wait_for(proc.wait(), timeout=timeout)
except asyncio.TimeoutError:
proc.kill() # Force SIGKILL
await proc.wait() # CRITICAL: Always reap zombie
```
### Session Directory Management
```python
# Source: Phase context + research
from pathlib import Path
import json
from datetime import datetime
def create_session_directory(name: str, persona: str = None) -> Path:
"""Create isolated session directory with metadata."""
session_dir = Path.home() / 'telegram' / 'sessions' / name
session_dir.mkdir(parents=True, exist_ok=True)
metadata = {
'name': name,
'created': datetime.now().isoformat(),
'persona': persona,
'pid': None,
'status': 'idle',
'last_active': None
}
metadata_file = session_dir / 'metadata.json'
metadata_file.write_text(json.dumps(metadata, indent=2))
return session_dir
```
### Parsing stream-json Output
```python
# Source: https://code.claude.com/docs/en/headless
import json
import logging
logger = logging.getLogger(__name__)
def parse_stream_json_line(line: str):
"""Parse single line of stream-json output."""
try:
event = json.loads(line)
return event
except json.JSONDecodeError:
logger.warning(f"Invalid JSON line: {line}")
return None
async def handle_claude_output(stream, telegram_chat_id, bot):
"""Handle Claude Code stream-json output."""
while True:
line = await stream.readline()
if not line:
break
event = parse_stream_json_line(line.decode().rstrip())
if not event:
continue
event_type = event.get('type')
if event_type == 'assistant':
# Extract text from assistant message
content = event.get('message', {}).get('content', [])
for block in content:
if block.get('type') == 'text':
text = block.get('text', '')
await bot.send_message(chat_id=telegram_chat_id, text=text)
elif event_type == 'result':
# Task completion
if event.get('is_error'):
await bot.send_message(
chat_id=telegram_chat_id,
text="Claude encountered an error."
)
```
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| PTY for all subprocess interaction | Pipes with concurrent reading for non-interactive programs | Python 3.6+ asyncio maturity | Simpler code, fewer edge cases, better error handling |
| Sequential stdout/stderr reading | Concurrent `asyncio.gather()` | Python 3.5 async/await | Eliminates deadlocks from buffer overflow |
| Manual signal handling for termination | `terminate()` + `wait_for()` with timeout | Python 3.7+ | Graceful shutdown with fallback to SIGKILL |
| Thread-based Telegram bots | Async python-telegram-bot 20.0+ | v20.0 (2023) | Native async/await, better performance |
| File-based Claude interaction | Stream-json subprocess with live parsing | Claude Code 2.0+ (2024) | Real-time responses, lower latency |
**Deprecated/outdated:**
- **python-telegram-bot sync mode (< v20):** Deprecated, removed in v20. All new code must use async/await.
- **subprocess.PIPE without concurrent reading:** Known deadlock risk since Python 3.4, documented as anti-pattern
- **PTY for Claude Code:** Unnecessary; Claude Code designed for pipe interaction, handles non-TTY gracefully
## Open Questions
Things that couldn't be fully resolved:
1. **Claude Code auto-restart behavior with --resume**
- What we know: `--resume` loads session by ID, `--continue` loads most recent in directory
- What's unclear: If Claude Code crashes mid-response, can we auto-restart with `--continue` and it resumes cleanly? Or do we need to track message history ourselves?
- Recommendation: Test crash recovery behavior. Likely safe to use `--continue` in session directory after crash - Claude Code manages history in `.claude/` subdirectory.
2. **Optimal buffer limit for long-running sessions**
- What we know: `limit` parameter on `create_subprocess_exec()` controls StreamReader buffer size (default 64KB)
- What's unclear: Should we increase for Claude Code's potentially long responses? What's the memory tradeoff?
- Recommendation: Start with default (64KB). Monitor in Phase 4. Claude Code stream-json outputs line-by-line, so readline() should prevent buffer buildup.
3. **Handling concurrent messages during Claude processing**
- What we know: User might send multiple messages while Claude is responding
- What's unclear: Queue to subprocess stdin (if using `--input-format stream-json`)? Or wait for completion and send as new turn?
- Recommendation: Phase context says "queue messages, send after response completes." For Phase 1, buffer messages in Python and send as new `claude -p` invocation after previous completes. Phase 2+ might use `--input-format stream-json` for live piping.
4. **Session metadata beyond basics**
- What we know: Need name, PID, timestamps, persona at minimum
- What's unclear: Should we track message count, last message timestamp, token usage, Claude Code session ID?
- Recommendation: Keep it minimal for Phase 1. Metadata schema:
```json
{
"name": "session-name",
"created": "2026-02-04T14:20:00Z",
"last_active": "2026-02-04T15:30:00Z",
"persona": "brainstorm",
"pid": 12345,
"status": "active|suspended|idle"
}
```
Add fields in later phases as needed (token tracking in Phase 4, etc.)
## Sources
### Primary (HIGH confidence)
- [Python asyncio subprocess documentation](https://docs.python.org/3/library/asyncio-subprocess.html) - Official Python 3.14 docs
- [Claude Code CLI reference](https://code.claude.com/docs/en/cli-reference) - Official Anthropic documentation
- [Claude Code headless mode](https://code.claude.com/docs/en/headless) - Official programmatic usage guide
- [python-telegram-bot Concurrency wiki](https://github.com/python-telegram-bot/python-telegram-bot/wiki/Concurrency) - Official PTB documentation
### Secondary (MEDIUM confidence)
- [Super Fast Python - Asyncio Subprocess](https://superfastpython.com/asyncio-subprocess/) - Practical examples verified against official docs
- [Python asyncio subprocess termination best practices](https://www.slingacademy.com/article/python-asyncio-how-to-stop-kill-a-child-process/) - Community best practices, verified with official docs
- [Claude Code session management guide](https://stevekinney.com/courses/ai-development/claude-code-session-management) - Educational resource on Claude sessions
- [Stream-JSON chaining wiki](https://github.com/ruvnet/claude-flow/wiki/Stream-Chaining) - Community documentation on stream-json format
### Tertiary (LOW confidence)
- WebSearch results on asyncio best practices - Multiple sources, cross-referenced but not deeply verified
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH - All libraries verified in use on mgmt container, versions confirmed
- Architecture: HIGH - Patterns sourced from official Python and Claude Code documentation
- Pitfalls: HIGH - Documented in Python subprocess docs, verified through official warnings
**Research date:** 2026-02-04
**Valid until:** 2026-03-04 (30 days - Python asyncio and Claude Code are stable, slow-moving APIs)

View file

@ -1,231 +0,0 @@
---
phase: 02-telegram-integration
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- telegram/claude_subprocess.py
- telegram/telegram_utils.py
autonomous: true
must_haves:
truths:
- "Persistent Claude Code subprocess accepts multiple messages without respawning"
- "Subprocess emits tool_use events (tool_name + tool_input dict) that bot.py can consume for progress notifications"
- "Long messages are split at smart boundaries without breaking MarkdownV2 code blocks"
- "MarkdownV2 special characters are properly escaped before sending to Telegram"
- "Typing indicator can be maintained for arbitrarily long operations via re-send loop"
artifacts:
- path: "telegram/claude_subprocess.py"
provides: "Persistent subprocess with stream-json stdin/stdout"
contains: "input-format.*stream-json"
- path: "telegram/telegram_utils.py"
provides: "Message splitting, MarkdownV2 escaping, typing indicator loop"
exports: ["split_message_smart", "escape_markdown_v2", "typing_indicator_loop"]
key_links:
- from: "telegram/claude_subprocess.py"
to: "claude CLI"
via: "--input-format stream-json stdin piping"
pattern: "input-format.*stream-json"
- from: "telegram/claude_subprocess.py"
to: "callbacks"
via: "on_output/on_error/on_complete/on_status/on_tool_use"
pattern: "on_tool_use"
---
<objective>
Refactor ClaudeSubprocess from fresh-process-per-turn to persistent process with stream-json I/O, and create Telegram utility functions for message formatting and typing indicators.
Purpose: The persistent process model eliminates ~1s spawn overhead per message, maintains conversation context across turns, and enables real-time tool call notifications. The utility functions provide safe MarkdownV2 formatting and smart message splitting that the bot integration (Plan 02) will consume.
Output: Refactored `claude_subprocess.py` with persistent subprocess, new `telegram_utils.py` with message splitting/escaping/typing utilities.
</objective>
<execution_context>
@/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md
@/home/mikkel/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-telegram-integration/02-RESEARCH.md
@.planning/phases/02-telegram-integration/02-CONTEXT.md
@.planning/phases/01-session-process-foundation/01-02-SUMMARY.md
@telegram/claude_subprocess.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Refactor ClaudeSubprocess to persistent process with stream-json I/O</name>
<files>telegram/claude_subprocess.py</files>
<action>
Refactor ClaudeSubprocess to maintain a single long-lived process per session using `--input-format stream-json --output-format stream-json` instead of spawning fresh `claude -p` per turn.
**Key changes:**
1. **New start() method:** Spawns persistent subprocess with:
```
claude -p --input-format stream-json --output-format stream-json --verbose
```
Plus persona flags (--system-prompt, --model, --max-turns) and --continue if `.claude/` exists.
Launches background tasks for `_read_stdout()` and `_read_stderr()` (concurrent readers from Phase 1 pattern).
2. **Refactor send_message():** Instead of spawning a new process, writes NDJSON to stdin:
```python
msg = json.dumps({"type": "user", "content": message}) + '\n'
self._process.stdin.write(msg.encode())
await self._process.stdin.drain() # CRITICAL: prevent buffer deadlock
```
Sets `_busy = True`. If process not started, calls `start()` first.
If already busy, queues the message (existing queue behavior preserved).
3. **Add on_tool_use callback:** New callback `on_tool_use(tool_name: str, tool_input: dict)` for progress notifications.
In `_handle_stdout_line()`, parse `content_block_start` events with `type: "tool_use"` to extract tool name and input.
Stream-json emits these as separate events during assistant turns.
**Responsibility split:** `claude_subprocess.py` extracts `tool_name` (e.g. "Read", "Bash", "Edit") and passes the raw `tool_input` dict (e.g. `{"file_path": "/foo/bar"}`, `{"command": "ls -la"}`) directly to the callback. It does NOT format human-readable descriptions -- that is bot.py's job in Plan 02 (Part B, item 5). This keeps the subprocess layer format-agnostic.
4. **Update _handle_stdout_line():** Handle the stream-json event types from persistent mode:
- `assistant`: Extract text blocks, call on_output with final text
- `result`: Turn complete, set `_busy = False`, call on_complete, process queue
- `content_block_start` / `content_block_delta` with tool_use type: Extract tool name + target, call on_tool_use
- `system`: System events, log and handle errors
5. **Update _read_streams() / _read_stdout():** Since the process is persistent, the stdout reader must NOT exit when a turn completes. It stays alive reading events indefinitely. Remove the "mark as not busy" logic from _read_streams and move it to the result event handler in _handle_stdout_line instead. The reader only exits when process dies (readline returns empty bytes).
6. **Process lifecycle:**
- `start()` — spawns process, starts readers
- `send_message()` — writes to stdin (auto-starts if needed)
- `terminate()` — closes stdin, sends SIGTERM, waits, SIGKILL fallback (existing pattern)
- `is_alive` — check process.returncode is None
- `is_busy` — check _busy flag
7. **Crash recovery:** Keep existing crash recovery logic but adapt: when process dies unexpectedly, restart with `start()` and resend any queued messages. The --continue flag in start() ensures session context is preserved.
8. **Remove _spawn() method:** Replace with start() for process lifecycle and send_message() for message delivery. The old pattern of passing message to _spawn() is no longer needed since messages go to stdin.
**Preserve from Phase 1:**
- Callback architecture (on_output, on_error, on_complete, on_status + new on_tool_use)
- Message queue (asyncio.Queue)
- Graceful termination (SIGTERM → wait_for → SIGKILL → wait)
- Crash retry logic (MAX_CRASH_RETRIES, backoff)
- PATH environment setup
- Session directory as cwd
- Timing/logging instrumentation
**Important implementation notes:**
- The stream-json input format message structure: Test with `{"type": "user", "content": "message text"}`. If that fails, try simpler `{"content": "message text"}`. Check `claude --help` or test interactively.
- Always `await proc.stdin.drain()` after writing to prevent pipe buffer deadlock
- The stdout reader task must run for the lifetime of the process, not per-turn
- Track busy state via result events, not process completion
</action>
<verify>
1. Run `python -c "import telegram.claude_subprocess; print('import OK')"` from ~/homelab (or equivalent sibling import test)
2. Verify the class has start(), send_message(), terminate() methods
3. Verify on_tool_use callback parameter exists in __init__
4. Verify --input-format stream-json appears in the command construction
5. Verify stdin.write + drain pattern in send_message
6. Verify `_read_stdout()` loop continues after result events: inspect code to confirm the loop only exits on empty readline (i.e. `line = await self._process.stdout.readline(); if not line: break`), NOT on receiving a result event. Result events should set `_busy = False` and call `on_complete` but NOT break the reader loop.
</verify>
<done>
ClaudeSubprocess maintains a persistent process that accepts NDJSON messages on stdin, emits stream-json events on stdout, routes tool_use events to on_tool_use callback, and tracks busy state via result events instead of process completion. No fresh process spawned per turn.
</done>
</task>
<task type="auto">
<name>Task 2: Create telegram_utils.py with message formatting and typing indicator</name>
<files>telegram/telegram_utils.py</files>
<action>
Create `telegram/telegram_utils.py` with three utility functions consumed by the bot integration in Plan 02.
**1. Smart message splitting — `split_message_smart(text: str, max_length: int = 4000) -> list[str]`:**
- Split long messages respecting MarkdownV2 code block boundaries
- Never split inside triple-backtick code blocks
- Prefer paragraph breaks (`\n\n`), then line breaks (`\n`), then hard character split as last resort
- Use 4000 as default max (not 4096) to leave room for MarkdownV2 escape character expansion
- Track `in_code_block` state by counting triple-backtick lines
- If a code block exceeds max_length by itself, include it whole (Telegram will handle overflow gracefully or we truncate with "... (truncated)" marker)
- Algorithm from research: iterate lines, track code block state, split only when NOT in code block and would exceed limit
**2. MarkdownV2 escaping — `escape_markdown_v2(text: str) -> str`:**
- Escape the 17 MarkdownV2 special characters: `_ * [ ] ( ) ~ \` > # + - = | { } . !`
- BUT do NOT escape inside code blocks (text between triple backticks or single backticks)
- Strategy: Split text by code regions, escape only non-code regions, rejoin
- For inline code (single backtick): don't escape content between backticks
- For code blocks (triple backtick): don't escape content between triple backticks
- Use regex to identify code regions: find ```...``` and `...` blocks, escape everything else
**3. Typing indicator loop — `async def typing_indicator_loop(bot, chat_id: int, stop_event: asyncio.Event)`:**
- Send `ChatAction.TYPING` every 4 seconds until stop_event is set
- Use `asyncio.wait_for(stop_event.wait(), timeout=4.0)` pattern from research
- Catch exceptions from send_chat_action (network errors) and log warning, continue loop
- Clean exit when stop_event is set
- Import from `telegram import ChatAction`
**Module structure:**
```python
"""
Telegram message formatting and UX utilities.
Provides smart message splitting, MarkdownV2 escaping, and typing indicator
management for the Telegram Claude Code bridge.
"""
import asyncio
import logging
import re
from telegram import ChatAction
logger = logging.getLogger(__name__)
TELEGRAM_MAX_LENGTH = 4096
SAFE_LENGTH = 4000
def split_message_smart(text: str, max_length: int = SAFE_LENGTH) -> list[str]:
...
def escape_markdown_v2(text: str) -> str:
...
async def typing_indicator_loop(bot, chat_id: int, stop_event: asyncio.Event):
...
```
</action>
<verify>
1. Run `python -c "from telegram_utils import split_message_smart, escape_markdown_v2, typing_indicator_loop; print('imports OK')"` from ~/homelab/telegram/
2. Test split_message_smart: `split_message_smart("a" * 5000)` returns list with 2+ chunks, each <= 4000 chars
3. Test split_message_smart with code block: message containing ``` block is not split inside the block
4. Test escape_markdown_v2: `escape_markdown_v2("hello_world")` returns `"hello\_world"`
5. Test escape_markdown_v2 preserves code: text inside backticks is NOT escaped
</verify>
<done>
telegram_utils.py exists with split_message_smart (code-block-aware splitting), escape_markdown_v2 (context-sensitive escaping), and typing_indicator_loop (4s re-send with asyncio.Event cancellation). All functions importable and tested with basic cases.
</done>
</task>
</tasks>
<verification>
1. `python -c "from telegram.claude_subprocess import ClaudeSubprocess"` succeeds (or sibling import equivalent)
2. `python -c "from telegram_utils import split_message_smart, escape_markdown_v2, typing_indicator_loop"` from telegram/ succeeds
3. ClaudeSubprocess.__init__ accepts on_tool_use callback
4. ClaudeSubprocess has start(), send_message(), terminate() methods
5. send_message() writes NDJSON to stdin (not spawning new process)
6. split_message_smart handles code blocks without breaking them
7. escape_markdown_v2 escapes outside code blocks only
</verification>
<success_criteria>
- ClaudeSubprocess uses persistent process with --input-format stream-json
- Messages sent via stdin NDJSON, not fresh process spawn
- Tool use events parsed and routed to on_tool_use callback
- Smart message splitting respects code block boundaries
- MarkdownV2 escaping handles all 17 special characters with code block awareness
- Typing indicator loop re-sends every 4 seconds with clean cancellation
</success_criteria>
<output>
After completion, create `.planning/phases/02-telegram-integration/02-01-SUMMARY.md`
</output>

View file

@ -1,148 +0,0 @@
---
phase: 02-telegram-integration
plan: 01
subsystem: infra
tags: [asyncio, subprocess, python, claude-code-cli, stream-json, telegram, markdown, typing-indicator]
# Dependency graph
requires:
- phase: 01-session-process-foundation
provides: Session management and subprocess foundations
provides:
- Persistent Claude Code subprocess with stream-json I/O (eliminates ~1s spawn overhead per turn)
- Tool call progress notifications via on_tool_use callback
- Smart message splitting respecting MarkdownV2 code block boundaries
- MarkdownV2 escaping with code block awareness
- Typing indicator loop for long operations
affects: [02-02, bot-integration, message-handling]
# Tech tracking
tech-stack:
added:
- telegram.constants.ChatAction (typing indicators)
patterns:
- "Persistent subprocess with NDJSON stdin streaming"
- "Stream-json event routing including tool_use events"
- "Smart message splitting at code block boundaries"
- "Context-sensitive MarkdownV2 escaping (preserves code blocks)"
- "Typing indicator loop with asyncio.Event cancellation"
key-files:
created:
- telegram/telegram_utils.py
modified:
- telegram/claude_subprocess.py
key-decisions:
- "Use persistent subprocess instead of fresh process per turn (eliminates ~1s overhead)"
- "Write NDJSON to stdin with drain() to prevent pipe buffer deadlock"
- "Persistent readers run for process lifetime, result events mark not busy but don't exit loop"
- "Parse content_block_start events for tool_use to enable progress notifications"
- "Split messages at 4000 chars (not 4096) to leave room for escape character expansion"
- "Never split inside code blocks (track in_code_block state while iterating lines)"
- "Escape MarkdownV2 special chars only outside code blocks (regex pattern to identify code regions)"
patterns-established:
- "Pattern 1: Persistent stdin streaming - start() spawns process, send_message() writes NDJSON, readers run until process dies"
- "Pattern 2: Tool use notifications - Parse content_block_start with type=tool_use, extract name and input dict, pass to callback"
- "Pattern 3: Smart message splitting - Track code block state, split only when NOT in code block and would exceed limit"
- "Pattern 4: Code-aware escaping - Split by code regions (regex), escape only non-code text, rejoin"
- "Pattern 5: Typing loop - Re-send every 4s via asyncio.wait_for timeout until Event is set"
# Metrics
duration: 5min
completed: 2026-02-04
---
# Phase 2 Plan 1: Persistent Subprocess + Telegram Utils Summary
**Persistent Claude Code subprocess with stream-json I/O and Telegram message formatting utilities for bot integration**
## Performance
- **Duration:** 5 minutes
- **Started:** 2026-02-04T19:12:28Z
- **Completed:** 2026-02-04T19:17:24Z
- **Tasks:** 2
- **Files created:** 1
- **Files modified:** 1
## Accomplishments
- Refactored ClaudeSubprocess from fresh-process-per-turn to persistent process model
- Eliminated ~1s spawn overhead per message turn
- Added stream-json stdin I/O for NDJSON message delivery to persistent process
- Implemented tool_use event parsing and on_tool_use callback for progress notifications
- Created telegram_utils.py with smart message splitting, MarkdownV2 escaping, and typing indicator loop
- Smart splitting respects code block boundaries (never splits inside triple-backtick blocks)
- MarkdownV2 escaping preserves code blocks while escaping 17 special characters outside code
- Typing indicator loop re-sends every 4 seconds with clean asyncio.Event cancellation
## Task Commits
Each task was committed atomically:
1. **Task 1: Refactor ClaudeSubprocess to persistent process** - `6a115a4` (refactor)
2. **Task 2: Create telegram_utils.py** - `6b624d7` (feat)
## Files Created/Modified
**Created:**
- `telegram/telegram_utils.py` - Message formatting utilities: split_message_smart (code-block-aware splitting), escape_markdown_v2 (context-sensitive escaping), typing_indicator_loop (4s re-send pattern)
**Modified:**
- `telegram/claude_subprocess.py` - Refactored to persistent subprocess with --input-format stream-json stdin, persistent stdout/stderr readers, tool_use event parsing, NDJSON message delivery via stdin.write + drain pattern
## Decisions Made
1. **Persistent process architecture**: Spawn Claude Code once with stream-json I/O, write NDJSON to stdin for each turn. Eliminates ~1s spawn overhead per message and maintains conversation context across turns.
2. **Tool use callback design**: Extract tool name and raw input dict from content_block_start events, pass directly to callback. Bot layer (Plan 02) will format human-readable progress messages. Keeps subprocess layer format-agnostic.
3. **Smart splitting strategy**: Track code block state (in/out of triple backticks), split only when NOT in code block and would exceed 4000 char limit. If code block itself exceeds limit, include it whole (Telegram handles overflow).
4. **Escape only outside code**: Use regex to identify code regions (```...``` and `...`), escape MarkdownV2 special chars only in non-code text. Prevents breaking code syntax while properly escaping user-facing text.
5. **4000 char split threshold**: Use 4000 instead of Telegram's 4096 limit to leave room for MarkdownV2 escape character expansion (each special char becomes 2 chars with backslash).
6. **Typing loop cancellation pattern**: Use asyncio.wait_for with 4s timeout on Event.wait(). Clean cancellation when Event is set, automatic re-send on timeout. Pattern from PTB community best practices.
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
**Issue 1: ChatAction import path**
- **Problem:** Initial import `from telegram import ChatAction` failed with ImportError
- **Root cause:** python-telegram-bot v22+ moved ChatAction to telegram.constants
- **Resolution:** Changed to `from telegram.constants import ChatAction`
- **Impact:** Minor, caught during testing, fixed immediately
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
**Ready for Plan 02 (Bot Integration):**
- ClaudeSubprocess provides persistent process with tool call notifications
- telegram_utils provides all message formatting needed for bot layer
- Smart splitting and escaping handle Claude's output for Telegram delivery
- Typing indicator loop ready for integration with message handlers
- stdin.write + drain pattern ensures no pipe deadlock
**Integration points for Plan 02:**
- Call `subprocess.start()` once per session
- Pass `on_tool_use` callback to format and send progress messages
- Use `split_message_smart()` on Claude output before sending to Telegram
- Use `escape_markdown_v2()` if sending with MarkdownV2 parse mode
- Wrap `subprocess.send_message()` with `typing_indicator_loop()` background task
- Handle file uploads by saving to session directory
- Implement message batching with debounce timer
**No blockers or concerns.**
---
*Phase: 02-telegram-integration*
*Completed: 2026-02-04*

View file

@ -1,336 +0,0 @@
---
phase: 02-telegram-integration
plan: 02
type: execute
wave: 2
depends_on: ["02-01"]
files_modified:
- telegram/bot.py
- telegram/message_batcher.py
autonomous: false
must_haves:
truths:
- "User sends message in Telegram and receives Claude's response formatted in MarkdownV2"
- "Typing indicator stays visible during entire Claude processing time (10-60s+)"
- "User sees tool call progress notifications (e.g. 'Reading config.json...')"
- "Rapid sequential messages are batched into a single Claude prompt"
- "User attaches photo in Telegram and Claude auto-analyzes it"
- "User attaches document in Telegram and Claude can reference it in session"
- "Responses longer than 4096 chars are split across multiple messages without breaking code blocks"
- "Bot runs as systemd user service and restarts on failure"
artifacts:
- path: "telegram/bot.py"
provides: "Updated message handlers with typing, progress, batching, file handling"
contains: "typing_indicator_loop"
- path: "telegram/message_batcher.py"
provides: "MessageBatcher class for debounce-based message batching"
exports: ["MessageBatcher"]
- path: "~/.config/systemd/user/telegram-bot.service"
provides: "Systemd user service unit for bot"
contains: "telegram-bot"
key_links:
- from: "telegram/bot.py"
to: "telegram/claude_subprocess.py"
via: "ClaudeSubprocess.send_message() and callbacks"
pattern: "send_message"
- from: "telegram/bot.py"
to: "telegram/telegram_utils.py"
via: "split_message_smart, escape_markdown_v2, typing_indicator_loop"
pattern: "split_message_smart|escape_markdown_v2|typing_indicator_loop"
- from: "telegram/bot.py"
to: "telegram/message_batcher.py"
via: "MessageBatcher.add_message()"
pattern: "MessageBatcher"
---
<objective>
Wire the persistent subprocess and utility functions into the Telegram bot with typing indicators, progress notifications, message batching, file handling, and systemd service setup.
Purpose: This plan makes the entire system work end-to-end. Messages flow from Telegram through the batcher to the persistent Claude subprocess, responses come back formatted in MarkdownV2 with smart splitting, and the user sees typing indicators and tool call progress throughout. File attachments land in session folders with auto-analysis. The systemd service ensures reliability across container restarts.
Output: Updated `bot.py` with full integration, new `message_batcher.py`, systemd service file, working end-to-end flow.
</objective>
<execution_context>
@/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md
@/home/mikkel/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/02-telegram-integration/02-RESEARCH.md
@.planning/phases/02-telegram-integration/02-CONTEXT.md
@.planning/phases/02-telegram-integration/02-01-SUMMARY.md
@telegram/bot.py
@telegram/claude_subprocess.py
@telegram/telegram_utils.py
@telegram/session_manager.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create MessageBatcher and update bot.py with typing, progress, batching, and file handling</name>
<files>telegram/message_batcher.py, telegram/bot.py</files>
<action>
**Part A: Create telegram/message_batcher.py**
Implement `MessageBatcher` class for debounce-based message batching:
```python
class MessageBatcher:
def __init__(self, callback: Callable, debounce_seconds: float = 2.0):
...
async def add_message(self, message: str):
"""Add message, reset debounce timer. When timer expires, flush batch via callback."""
...
```
- Uses asyncio.Queue to collect messages
- Cancels previous debounce timer when new message arrives
- After debounce_seconds of silence, joins all queued messages with `\n\n` and calls callback
- Callback is async (receives combined message string)
- Handles CancelledError gracefully during timer cancellation
- Follow research pattern from 02-RESEARCH.md (MessageBatcher section)
**Part B: Update telegram/bot.py — make_callbacks() overhaul**
Replace the current `make_callbacks()` with a new version that uses telegram_utils:
```python
from telegram_utils import split_message_smart, escape_markdown_v2, typing_indicator_loop
from message_batcher import MessageBatcher
```
New `make_callbacks(bot, chat_id)` returns dict or tuple of callbacks:
1. **on_output(text):**
- Split text using `split_message_smart(text)`
- For each chunk: try sending with `parse_mode='MarkdownV2'` after `escape_markdown_v2()`
- If MarkdownV2 parse fails (Telegram BadRequest), fall back to plain text send
- Stop the typing indicator (set stop_event)
2. **on_error(error):**
- Send error message to chat (plain text, no MarkdownV2)
- Stop the typing indicator
3. **on_complete():**
- Stop the typing indicator (set stop_event)
- Log completion
4. **on_status(status):**
- Send status as a brief message (e.g., "Claude restarted with context preserved")
5. **on_tool_use(tool_name, tool_input):** (NEW)
- Format tool call notification: extract meaningful target from tool_input
- For Bash tool: show command preview (first 50 chars)
- For Read tool: show file path
- For Edit tool: show file path
- For Grep/Glob: show pattern
- For Write tool: show file path
- Send as a single editable progress message (edit_message_text on a progress message)
- OR send as separate short messages (planner's discretion — separate messages are simpler and more reliable)
- Format: italic text like `_Reading config.json..._`
**Part C: Update handle_message()**
Overhaul the message handler to use typing indicators and message batching:
1. On message received:
- Start typing indicator loop: `stop_typing = asyncio.Event()`, `asyncio.create_task(typing_indicator_loop(...))`
- Pass stop_typing event to callbacks so on_output/on_complete can stop it
- Get or create subprocess (existing logic, but use `start()` instead of constructor for persistent process)
2. Message batching:
- Create one `MessageBatcher` per session (store in dict alongside subprocesses)
- Batcher callback = `subprocess.send_message()`
- On message: `await batcher.add_message(text)` instead of direct `subprocess.send_message()`
- Typing indicator starts immediately on first message, stops on Claude response
3. Subprocess auto-start (integrate into existing handle_message after session lookup, before batcher):
```python
# In handle_message(), after resolving active session:
session_id = session_manager.get_active_session(user_id)
# Get or create subprocess for this session (avoid double-start)
if session_id not in self.subprocesses or not self.subprocesses[session_id].is_alive:
callbacks = make_callbacks(bot, chat_id, stop_typing_event)
subprocess = ClaudeSubprocess(
session_dir=session_dir,
on_output=callbacks['on_output'],
on_error=callbacks['on_error'],
on_complete=callbacks['on_complete'],
on_status=callbacks['on_status'],
on_tool_use=callbacks['on_tool_use'],
)
await subprocess.start()
self.subprocesses[session_id] = subprocess
else:
subprocess = self.subprocesses[session_id]
```
- The `is_alive` check prevents double-start: only creates and starts if no subprocess exists for session or previous one died
- `self.subprocesses` is a dict[str, ClaudeSubprocess] stored on the handler/application context (same pattern as existing subprocess tracking in bot.py)
**Part D: Update handle_photo() and handle_document()**
Save files to active session folder instead of global images/files directories:
1. **handle_photo():**
- Get active session directory from session_manager
- If no active session, prompt user to create one
- Download highest-quality photo to session directory as `photo_YYYYMMDD_HHMMSS.jpg`
- Auto-analyze: send message to Claude subprocess: "I've attached a photo: {filename}. {caption or 'Please describe what you see.'}"
- Start typing indicator while Claude analyzes
2. **handle_document():**
- Get active session directory from session_manager
- If no active session, prompt user to create one
- Download document to session directory with original filename (timestamp prefix for collision avoidance)
- If caption provided: send caption + "The file {filename} has been saved to your session." to Claude
- If no caption: send "User uploaded file: {filename}" to Claude (let Claude infer intent from context, per CONTEXT.md decision)
**Part E: Update switch_session_cmd() and archive_session_cmd()**
- On session switch: stop typing indicator for current session if running
- On session switch: batcher should flush immediately (don't lose queued messages)
- On archive: terminate subprocess, remove batcher
</action>
<verify>
1. `python -c "from message_batcher import MessageBatcher; print('import OK')"` from ~/homelab/telegram/
2. bot.py imports telegram_utils functions and MessageBatcher without errors
3. make_callbacks includes on_tool_use callback
4. handle_message uses typing_indicator_loop
5. handle_photo saves to session directory (not global images/)
6. handle_document saves to session directory (not global files/)
7. MessageBatcher has add_message() method
</verify>
<done>
MessageBatcher debounces rapid messages with configurable timer. Bot handlers use typing indicators, progress notifications for tool calls, smart message splitting with MarkdownV2, and file handling saves to session directories with auto-analysis.
</done>
</task>
<task type="auto">
<name>Task 2: Create systemd user service for the bot</name>
<files>~/.config/systemd/user/telegram-bot.service</files>
<action>
Create or update the systemd user service unit for the Telegram bot.
**Service file at `~/.config/systemd/user/telegram-bot.service`:**
```ini
[Unit]
Description=Homelab Telegram Bot
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
WorkingDirectory=/home/mikkel/homelab/telegram
ExecStart=/home/mikkel/venv/bin/python bot.py
Restart=on-failure
RestartSec=10
KillMode=mixed
KillSignal=SIGTERM
TimeoutStopSec=30
# Environment
Environment=PATH=/home/mikkel/.local/bin:/home/mikkel/bin:/usr/local/bin:/usr/bin:/bin
[Install]
WantedBy=default.target
```
Key settings:
- **KillMode=mixed:** Sends SIGTERM to main process, SIGKILL to remaining children (ensures Claude subprocesses are cleaned up)
- **RestartSec=10:** Wait 10s before restart to avoid rapid restart loops
- **TimeoutStopSec=30:** Give bot time to gracefully terminate subprocesses before force kill
- **WorkingDirectory:** Set to telegram/ so sibling imports work
After creating the service file:
```bash
mkdir -p ~/.config/systemd/user
# Write service file
systemctl --user daemon-reload
systemctl --user enable telegram-bot.service
```
Do NOT start the service yet (user will start it after verifying manually).
Also ensure loginctl enable-linger is set for the mikkel user (allows user services to run without active login session). Check with `loginctl show-user mikkel -p Linger`. If not enabled, note it as a requirement but do NOT run the command (requires root).
</action>
<verify>
1. Service file exists at ~/.config/systemd/user/telegram-bot.service
2. `systemctl --user cat telegram-bot.service` shows the service configuration
3. `systemctl --user is-enabled telegram-bot.service` returns "enabled"
4. Service file has KillMode=mixed and correct WorkingDirectory
5. Check loginctl linger status and report
</verify>
<done>
Systemd user service is created and enabled (not started). Bot can be started with `systemctl --user start telegram-bot.service` and survives container restarts (with linger enabled). KillMode=mixed ensures Claude subprocesses are cleaned up on stop.
</done>
</task>
<task type="checkpoint:human-verify" gate="blocking">
<what-built>
Complete Telegram-Claude Code bidirectional messaging system:
- Persistent Claude Code subprocess with stream-json I/O (no respawn per turn)
- Typing indicator while Claude processes (re-sent every 4s)
- Tool call progress notifications (e.g., "Reading config.json...")
- Smart message splitting at paragraph/code block boundaries with MarkdownV2
- Message batching for rapid sequential messages (2s debounce)
- Photos/documents saved to session folder with auto-analysis
- Systemd user service for reliability
</what-built>
<how-to-verify>
1. Start the bot manually: `cd ~/homelab/telegram && ~/venv/bin/python bot.py`
2. In Telegram, create a session: `/new test-phase2`
3. Send a simple message: "Hello, what can you help me with?"
4. Verify: typing indicator appears while Claude processes
5. Verify: Claude's response arrives formatted properly
6. Send a message that triggers tool use: "Read the file ~/homelab/CLAUDE.md and summarize it"
7. Verify: you see tool call progress notification (e.g., "Reading CLAUDE.md...")
8. Verify: response is natural language summary (not raw code)
9. Send a photo with caption "What is this?"
10. Verify: photo is saved to ~/homelab/telegram/sessions/test-phase2/ and Claude analyzes it
11. Send 3 rapid messages (within 2 seconds): "one", "two", "three"
12. Verify: they are batched into a single Claude prompt
13. Type a long question that produces a response >4096 chars
14. Verify: response splits across multiple messages without broken code blocks
15. Check systemd: `systemctl --user status telegram-bot.service` shows enabled
16. Archive test session: `/archive test-phase2`
</how-to-verify>
<resume-signal>Type "approved" or describe issues found</resume-signal>
</task>
</tasks>
<verification>
1. End-to-end: Send message in Telegram, receive Claude's response back
2. Typing indicator visible during processing (10-60s range)
3. Tool call notifications appear for Read, Bash, Edit operations
4. Photo attachment saved to session folder and auto-analyzed
5. Document attachment saved to session folder
6. Long response properly split across messages
7. MarkdownV2 formatting renders correctly (bold, code blocks, etc.)
8. Rapid messages batched before sending to Claude
9. Systemd service enabled and configured with KillMode=mixed
10. Session switching stops typing indicator for previous session
</verification>
<success_criteria>
- User sends message in Telegram and receives Claude's response formatted in MarkdownV2
- Typing indicator visible for entire processing duration
- Tool call progress notifications appear
- Photos auto-analyzed, documents saved to session
- Long responses split correctly
- Rapid messages batched
- Systemd service configured and enabled
- Bot survives manual restart test
</success_criteria>
<output>
After completion, create `.planning/phases/02-telegram-integration/02-02-SUMMARY.md`
</output>

View file

@ -1,120 +0,0 @@
---
phase: 02-telegram-integration
plan: 02
subsystem: telegram
tags: [telegram, bot, typing-indicator, batching, file-handling, systemd, markdownv2]
# Dependency graph
requires:
- phase: 02-telegram-integration
plan: 01
provides: Persistent subprocess and telegram utils
provides:
- End-to-end Telegram-Claude Code messaging with typing indicators
- Message batching with debounce for rapid sequential messages
- Photo/document handling with auto-analysis
- Tool call progress notifications
- Systemd user service for reliability
affects: [03-lifecycle-management, bot-reliability]
# Tech tracking
tech-stack:
added:
- MessageBatcher (asyncio.Queue + debounce timer)
- systemd user service (KillMode=mixed)
patterns:
- "Dynamic typing event lookup via session name in typing_tasks dict"
- "Debounce-based message batching with asyncio.Queue"
- "--append-system-prompt preserves Claude Code defaults while adding persona"
- "--dangerously-skip-permissions for full tool access in non-interactive mode"
key-files:
created:
- telegram/message_batcher.py
- ~/.config/systemd/user/telegram-bot.service
modified:
- telegram/bot.py
- telegram/claude_subprocess.py
- telegram/telegram_utils.py
- telegram/personas/default.json
key-decisions:
- "Dynamic typing event lookup: callbacks reference typing_tasks dict by session name, not captured event"
- "--append-system-prompt instead of --system-prompt: preserves Claude Code model identity"
- "--dangerously-skip-permissions: allows all tools in non-interactive subprocess"
- "Full model ID in persona (claude-sonnet-4-5-20250929) instead of alias"
- "Stream-json NDJSON format: {type: user, message: {role: user, content: text}}"
patterns-established:
- "Pattern 1: Dynamic callback binding - closures look up mutable state by key instead of capturing immutable reference"
- "Pattern 2: Stale task cleanup - check task.done() and delete from dict before creating replacement"
- "Pattern 3: Persona via append - use --append-system-prompt to layer persona on top of CLI defaults"
# Metrics
duration: ~90min (including interactive debugging with user)
completed: 2026-02-04
---
# Phase 2 Plan 2: Bot Integration Summary
**End-to-end Telegram-Claude Code messaging with typing, batching, file handling, and systemd service**
## Performance
- **Duration:** ~90 minutes (including interactive testing and bug fixes)
- **Started:** 2026-02-04T19:30:00Z
- **Completed:** 2026-02-04T22:10:00Z
- **Tasks:** 3 (2 auto + 1 human-verify checkpoint)
- **Files created:** 2
- **Files modified:** 4
## Accomplishments
- Created MessageBatcher with debounce-based batching (2s timer, asyncio.Queue)
- Overhauled bot.py: typing indicators, tool call progress, message batching, file handling
- Created systemd user service with KillMode=mixed for subprocess cleanup
- Fixed stream-json NDJSON input format (nested message object required)
- Fixed typing indicator lifecycle (stale task cleanup + dynamic event lookup)
- Switched to --append-system-prompt to preserve Claude Code model identity
- Added --dangerously-skip-permissions for full tool access
- Set full model ID (claude-sonnet-4-5-20250929) in default persona
- Human-verified: messages flow, typing indicators show, model identifies correctly
## Task Commits
1. **Task 1: MessageBatcher + bot.py overhaul** - `f246d18` (feat)
2. **Task 2: Systemd user service** - created during same executor run
3. **Task 3: Human verification** - bugs found and fixed:
- `2d0d4da` - typing indicator lifecycle, model identity, tool permissions
## Bugs Found During Testing
1. **Stream-json input format**: `{"type":"user","content":"..."}` rejected by Claude Code; correct format is `{"type":"user","message":{"role":"user","content":"..."}}`
2. **Typing indicator stale tasks**: After completion, typing task stayed in dict with set event; next message reused dead task and typing never started
3. **Typing indicator event mismatch**: Subprocess callbacks captured specific stop_typing event at creation time; new messages created new events but callbacks still set the old one
4. **Model identity lost**: `--system-prompt` overwrote Claude Code's built-in system prompt (which includes model version); fixed with `--append-system-prompt`
5. **Tool permissions blocked**: Non-interactive `-p` mode blocked tools requiring permission prompts; fixed with `--dangerously-skip-permissions`
6. **Model alias resolution**: `sonnet` alias resolved to 3.5 Sonnet in CLI; fixed by using full model ID
## Files Created/Modified
**Created:**
- `telegram/message_batcher.py` - Debounce-based message batching with asyncio.Queue
- `~/.config/systemd/user/telegram-bot.service` - Systemd user service with KillMode=mixed
**Modified:**
- `telegram/bot.py` - Full overhaul: dynamic typing callbacks, stale task cleanup, batching, file handling
- `telegram/claude_subprocess.py` - NDJSON fix, --dangerously-skip-permissions, --append-system-prompt, full cmd logging
- `telegram/telegram_utils.py` - Debug logging for typing indicator
- `telegram/personas/default.json` - Full model ID instead of alias
## Deviations from Plan
1. **Typing indicator architecture changed**: Plan assumed capturing stop_typing event in callbacks; changed to dynamic lookup from typing_tasks dict by session name to fix lifecycle bugs
2. **System prompt approach changed**: Plan used `--system-prompt`; changed to `--append-system-prompt` to preserve Claude Code defaults
3. **Added --dangerously-skip-permissions**: Not in original plan; needed for tool access in non-interactive mode
4. **Model ID changed**: Plan used `sonnet` alias; changed to full ID after alias resolution issues
---
*Phase: 02-telegram-integration*
*Completed: 2026-02-04*

View file

@ -1,72 +0,0 @@
# Phase 2: Telegram Integration - Context
**Gathered:** 2026-02-04
**Status:** Ready for planning
<domain>
## Phase Boundary
Bidirectional messaging between Telegram and Claude Code with file support and status feedback. User sends messages/files in Telegram, Claude processes via persistent JSON-streamed subprocess, and responds with natural language summaries. Typing indicators and tool call notifications provide progress visibility.
Lifecycle management (idle timeout, suspend/resume) is Phase 3. Output mode switching (verbose/smart) is Phase 4.
</domain>
<decisions>
## Implementation Decisions
### Conversation persistence
- Use JSON streaming to Claude Code so conversations don't cold start for every message
- Maintain persistent process per session rather than spawning fresh `claude -p` each turn
- This is a key architectural shift from Phase 1's fresh-process-per-turn model
### Response formatting
- Use Telegram's native MarkdownV2 for formatting (bold, italic, code, links)
- Claude should communicate in natural language about what it did — no raw code in responses
- Telegram is a mobile-first, on-the-go interface — responses should be concise summaries of work done, not code dumps
- Long responses split at smart boundaries (paragraph/code block breaks) — never break mid-code-block or mid-sentence
### Progress feedback
- Tool call notifications include tool name + target (e.g., "Reading config.json...", "Running npm test...")
- No timeout as long as there is progress — Claude Code tasks can legitimately take a while
### Claude's Discretion
- Whether to show typing indicator only vs typing + status messages while working
- Whether to send periodic progress updates on long tasks (2+ min) or just wait for completion
- Whether tool call notifications edit a single message in-place or appear as separate messages
- Whether multi-part responses come as one consolidated message or multiple messages
- Error message verbosity and format
### File handling
- Photos: save to session folder AND auto-analyze (describe what's in the image immediately)
- Documents/files: save directly to the session's working directory so Claude can reference them
- Claude can send files back to user as Telegram attachments when appropriate (scripts, logs, configs)
- Files sent without message text: infer intent from conversation context rather than asking
### Error & edge cases
- No active session + message: prompt user to create one ("No active session. Use /new <name> to start one")
- Rapid sequential messages: batch into a single prompt before sending to Claude
- No hard timeout — as long as progress is happening, let Claude work
</decisions>
<specifics>
## Specific Ideas
- "Telegram is used from a phone when I'm on the move — Claude should focus on doing the work and notifying me how it was solved and what was solved, without showing me code — use natural language"
- JSON streaming to avoid cold starts was explicitly called out as essential
- Batching rapid messages avoids wasting turns on partial thoughts
</specifics>
<deferred>
## Deferred Ideas
None — discussion stayed within phase scope
</deferred>
---
*Phase: 02-telegram-integration*
*Context gathered: 2026-02-04*

View file

@ -1,813 +0,0 @@
# Phase 2: Telegram Integration - Research
**Researched:** 2026-02-04
**Domain:** Claude Code stream-json I/O, Telegram bot UX (typing indicators, message editing), MarkdownV2 formatting, asyncio message batching
**Confidence:** HIGH
## Summary
Phase 2 requires bidirectional messaging between Telegram and Claude Code with persistent subprocess communication, file handling, and progress feedback. The core technical challenge is transitioning from Phase 1's "fresh process per turn" model to persistent processes that accept streamed input via `--input-format stream-json`.
Research confirms that Claude Code CLI supports `--input-format stream-json` for receiving NDJSON-formatted messages on stdin, enabling persistent processes that handle multiple turns without respawning. Telegram's python-telegram-bot library provides native typing indicators via `send_chat_action()`, message editing for progress updates, and robust file upload/download APIs. The critical UX gap is message splitting at the 4096 character limit — MarkdownV2 has complex escaping rules that make naive splitting dangerous.
Key findings: (1) `--input-format stream-json` + `--output-format stream-json` enables persistent bidirectional communication, (2) Typing indicators expire after 5 seconds and must be re-sent for long operations, (3) MarkdownV2 requires escaping 17 special characters with context-sensitive rules for code blocks, (4) Message batching should use asyncio.Queue with debounce timers to group rapid messages before sending to Claude.
**Primary recommendation:** Refactor ClaudeSubprocess to maintain a single long-lived process per session using `--input-format stream-json`, write NDJSON messages to stdin for each turn, implement a typing indicator loop that re-sends every 4 seconds during processing, and use smart message splitting that respects MarkdownV2 code block boundaries (never split inside triple-backtick blocks).
## Standard Stack
The established libraries/tools for this domain:
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| python-telegram-bot | 22.5+ | Telegram bot framework | Native async/await, typing indicators, message editing, file handling built-in |
| Claude Code CLI | 2.1.31+ | AI agent subprocess | `--input-format stream-json` for persistent processes, `--include-partial-messages` for streaming |
| asyncio | stdlib (3.12+) | Message batching, typing loops | Native async primitives for debouncing and periodic tasks |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| re | stdlib | MarkdownV2 escape regex | Escaping special characters before sending to Telegram |
| pathlib | stdlib | File path handling | Session folder file operations, attachment uploads |
| json | stdlib | NDJSON message formatting | Serializing messages to Claude Code stdin |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| Persistent subprocess with stdin | Multiple fresh `claude -p` calls | Persistent eliminates ~1s spawn overhead per turn but adds complexity |
| Typing indicator loop | Single send at start | Loop maintains indicator for operations >5s but requires background task |
| Smart message splitting | Naive character count split | Smart splitting respects markdown boundaries but requires parsing |
**Installation:**
```bash
# Already installed on mgmt container
source ~/venv/bin/activate
pip show python-telegram-bot # Version: 22.5
which claude # /home/mikkel/.local/bin/claude
claude --version # 2.1.31 (Claude Code)
```
## Architecture Patterns
### Recommended Process Model
```
Session lifecycle (Phase 2):
├── User sends /new → creates session → spawns persistent subprocess
├── User sends message → writes NDJSON to subprocess stdin
├── Subprocess emits stream-json events → parsed and sent to Telegram
└── User switches session → suspend current subprocess (keep alive for Phase 3 timeout)
Subprocess I/O:
stdin → NDJSON messages (one per turn)
stdout → stream-json events (assistant text, tool calls, result)
stderr → error logs
```
### Pattern 1: Persistent Process with stream-json I/O
**What:** Spawn Claude Code with `--input-format stream-json --output-format stream-json`, keep process alive, write NDJSON messages to stdin
**When to use:** Session creation, message handling
**Example:**
```python
# Source: https://code.claude.com/docs/en/cli-reference
import asyncio
import json
async def spawn_persistent_claude(session_dir: Path, persona: dict):
"""Spawn persistent Claude Code subprocess for session."""
cmd = [
'claude',
'--input-format', 'stream-json',
'--output-format', 'stream-json',
'--verbose',
'--continue', # Resume session if exists
]
# Add persona settings
if persona:
if 'system_prompt' in persona:
cmd.extend(['--system-prompt', persona['system_prompt']])
settings = persona.get('settings', {})
if 'max_turns' in settings:
cmd.extend(['--max-turns', str(settings['max_turns'])])
if 'model' in settings:
cmd.extend(['--model', settings['model']])
proc = await asyncio.create_subprocess_exec(
*cmd,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=str(session_dir)
)
return proc
async def send_message_to_subprocess(proc, message: str):
"""Send NDJSON message to subprocess stdin."""
msg = {'content': message}
ndjson_line = json.dumps(msg) + '\n'
proc.stdin.write(ndjson_line.encode())
await proc.stdin.drain()
```
### Pattern 2: Typing Indicator Loop
**What:** Send `send_chat_action(ChatAction.TYPING)` every 4 seconds while Claude is processing
**When to use:** After user sends message, stop when Claude response completes
**Example:**
```python
# Source: https://github.com/python-telegram-bot/python-telegram-bot/issues/2869
from telegram import ChatAction
import asyncio
async def typing_indicator_loop(bot, chat_id, stop_event: asyncio.Event):
"""Maintain typing indicator until stop_event is set."""
while not stop_event.is_set():
try:
await bot.send_chat_action(chat_id=chat_id, action=ChatAction.TYPING)
except Exception as e:
logger.warning(f"Failed to send typing indicator: {e}")
# Wait 4s or until stop_event (whichever comes first)
try:
await asyncio.wait_for(stop_event.wait(), timeout=4.0)
break # stop_event was set
except asyncio.TimeoutError:
continue # Re-send typing indicator
# Usage in message handler
stop_typing = asyncio.Event()
typing_task = asyncio.create_task(typing_indicator_loop(context.bot, chat_id, stop_typing))
# ... Claude processing happens ...
stop_typing.set()
await typing_task # Clean up
```
### Pattern 3: Smart Message Splitting with MarkdownV2
**What:** Split long messages at smart boundaries (paragraphs, code blocks) without breaking MarkdownV2 syntax
**When to use:** Before sending any message to Telegram (4096 char limit)
**Example:**
```python
# Source: https://limits.tginfo.me/en + MarkdownV2 research
import re
TELEGRAM_MAX_MESSAGE_LENGTH = 4096
def escape_markdown_v2(text: str) -> str:
"""Escape MarkdownV2 special characters."""
# 17 characters need escaping: _ * [ ] ( ) ~ ` > # + - = | { } . !
escape_chars = r'_*[]()~`>#+-=|{}.!'
return re.sub(f'([{re.escape(escape_chars)}])', r'\\\1', text)
def split_message_smart(text: str, max_length: int = 4000) -> list[str]:
"""
Split message at smart boundaries, never breaking MarkdownV2 code blocks.
Uses 4000 instead of 4096 to leave room for escape characters.
"""
if len(text) <= max_length:
return [text]
chunks = []
current_chunk = ""
# Split by paragraphs first
paragraphs = text.split('\n\n')
for para in paragraphs:
# Check if adding this paragraph exceeds limit
if len(current_chunk) + len(para) + 2 <= max_length:
if current_chunk:
current_chunk += '\n\n'
current_chunk += para
else:
# Paragraph too large or would overflow
if current_chunk:
chunks.append(current_chunk)
current_chunk = ""
# If single paragraph is too large, split by lines
if len(para) > max_length:
lines = para.split('\n')
for line in lines:
if len(current_chunk) + len(line) + 1 <= max_length:
if current_chunk:
current_chunk += '\n'
current_chunk += line
else:
if current_chunk:
chunks.append(current_chunk)
current_chunk = line
else:
current_chunk = para
if current_chunk:
chunks.append(current_chunk)
return chunks
```
### Pattern 4: Message Batching with Debounce
**What:** Collect rapid sequential messages in a queue, wait for pause, send batch to Claude
**When to use:** User typing multiple short messages in quick succession
**Example:**
```python
# Source: https://github.com/LiraNuna/aio-batching
import asyncio
class MessageBatcher:
"""Batch rapid messages with debounce timer."""
def __init__(self, debounce_seconds: float = 2.0):
self.queue: asyncio.Queue = asyncio.Queue()
self.debounce_seconds = debounce_seconds
self._batch_task: Optional[asyncio.Task] = None
async def add_message(self, message: str):
"""Add message to batch queue."""
await self.queue.put(message)
# Cancel existing batch timer and start new one
if self._batch_task and not self._batch_task.done():
self._batch_task.cancel()
self._batch_task = asyncio.create_task(self._wait_and_flush())
async def _wait_and_flush(self):
"""Wait for debounce period, then flush batched messages."""
await asyncio.sleep(self.debounce_seconds)
# Collect all queued messages
messages = []
while not self.queue.empty():
messages.append(await self.queue.get())
if messages:
# Send combined message to Claude
combined = '\n\n'.join(messages)
await self.send_to_claude(combined)
async def send_to_claude(self, message: str):
"""Override in subclass to handle batched message."""
pass
```
### Pattern 5: File Upload/Download
**What:** Save Telegram files to session folder, send files back as attachments
**When to use:** User sends photo/document, Claude generates file to share
**Example:**
```python
# Source: https://github.com/python-telegram-bot/python-telegram-bot/wiki/Working-with-Files-and-Media
from pathlib import Path
from telegram import Update
from telegram.ext import ContextTypes
async def handle_document(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Download document to session folder."""
doc = update.message.document
session_dir = get_active_session_dir()
# Download to session folder
file = await context.bot.get_file(doc.file_id)
filepath = session_dir / doc.file_name
await file.download_to_drive(filepath)
await update.message.reply_text(f"File saved: {doc.file_name}")
async def send_file_to_user(bot, chat_id: int, filepath: Path):
"""Send file from filesystem as Telegram document."""
with open(filepath, 'rb') as f:
await bot.send_document(chat_id=chat_id, document=f, filename=filepath.name)
```
### Pattern 6: Progress Updates via Message Editing
**What:** Edit a single message in-place to show tool call progress (alternative to separate messages)
**When to use:** When tool call notifications should update in-place rather than spam chat
**Example:**
```python
# Source: https://github.com/aiogram/aiogram + PTB message editing docs
async def send_progress_update(bot, chat_id: int, message_id: int, status: str):
"""Edit existing message with new status."""
try:
await bot.edit_message_text(
chat_id=chat_id,
message_id=message_id,
text=f"Status: {status}"
)
except Exception as e:
# Message might be too old or already deleted
logger.warning(f"Failed to edit message: {e}")
```
### Anti-Patterns to Avoid
- **Naive message splitting at character count:** Will break MarkdownV2 code blocks mid-syntax, causing parse errors
- **Single typing indicator at start:** Expires after 5 seconds, leaving long operations (30s+) without feedback
- **Spawning fresh subprocess per turn:** 1s overhead per message, loses conversation context between turns
- **Blocking asyncio.sleep() in message handler:** Freezes bot event loop, preventing other users from interacting
## Don't Hand-Roll
Problems that look simple but have existing solutions:
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| MarkdownV2 escaping | Custom regex for 17 special chars | Pre-built escape function or library | Context-sensitive rules (code blocks vs text), easy to miss edge cases |
| Message batching/debounce | Manual timer + queue | asyncio.Queue + asyncio.wait_for pattern | Handles cancellation, edge cases, timeout edge conditions |
| Typing indicator loop | Manual while loop + sleep | Asyncio task + Event for cancellation | Clean shutdown, no orphaned tasks, proper exception handling |
| Long message splitting | Character count slicing | Smart boundary detection (paragraph/code block) | Prevents breaking markdown syntax, better UX |
**Key insight:** Telegram's MarkdownV2 has 17 special characters with context-dependent escaping rules. Code blocks require different escaping than regular text. Links require escaping ')' and '\'. Hand-rolling this leads to subtle bugs that only surface with specific character combinations.
## Common Pitfalls
### Pitfall 1: MarkdownV2 Code Block Breaks from Naive Splitting
**What goes wrong:** Splitting long message at character count breaks triple-backtick code blocks mid-block, causing Telegram parse errors
**Why it happens:** MarkdownV2 requires balanced code block markers. Splitting inside \`\`\` block creates unmatched markers, invalid syntax.
**How to avoid:** Parse message for code block boundaries, never split inside \`\`\` ... \`\`\` region. Split at paragraph boundaries first, then line boundaries.
**Warning signs:** Telegram API errors "can't parse entities", malformed code display in chat
```python
# WRONG - Naive character count split
def split_naive(text, max_len=4096):
return [text[i:i+max_len] for i in range(0, len(text), max_len)]
# RIGHT - Respect code blocks
def split_smart(text, max_len=4000):
# Track if we're inside code block
in_code_block = False
chunks = []
current = ""
for line in text.split('\n'):
if line.startswith('```'):
in_code_block = not in_code_block
if len(current) + len(line) + 1 > max_len and not in_code_block:
chunks.append(current)
current = line
else:
if current:
current += '\n'
current += line
if current:
chunks.append(current)
return chunks
```
### Pitfall 2: Typing Indicator Expires During Long Operations
**What goes wrong:** Send typing indicator once at start, but Claude takes 30s to respond — user sees no feedback after 5s
**Why it happens:** Telegram expires typing status after 5 seconds. Single send() call doesn't maintain indicator through long operations.
**How to avoid:** Run typing indicator in background loop, re-send every 4 seconds until operation completes. Use asyncio.Event to signal completion.
**Warning signs:** Users ask "is bot working?", no visual feedback during 10-60s processing times
```python
# WRONG - Single typing send
await bot.send_chat_action(chat_id, ChatAction.TYPING)
# ... 30s of Claude processing ...
# Typing indicator expired after 5s
# RIGHT - Typing loop
stop_event = asyncio.Event()
typing_task = asyncio.create_task(typing_indicator_loop(bot, chat_id, stop_event))
# ... Claude processing ...
stop_event.set()
await typing_task
```
### Pitfall 3: Stdin Writes Without drain() Cause Deadlock
**What goes wrong:** Write many messages to subprocess stdin without calling drain(), pipe buffer fills, subprocess blocks writing stdout, deadlock
**Why it happens:** OS pipe buffers are finite (~64KB). If parent floods stdin faster than child reads, buffer fills. If child can't write stdout (parent not reading), both block forever.
**How to avoid:** Always call `await proc.stdin.drain()` after each write to ensure data is flushed. Continue concurrent stdout/stderr reading from Phase 1.
**Warning signs:** Subprocess hangs indefinitely, no output, both parent and child processes at 0% CPU
```python
# WRONG - Write without drain
proc.stdin.write(message.encode())
proc.stdin.write(message2.encode()) # Buffer overflow risk
# RIGHT - Write + drain
proc.stdin.write(message.encode())
await proc.stdin.drain()
```
### Pitfall 4: Message Batching Without Timeout Creates Indefinite Waits
**What goes wrong:** Batch messages waiting for pause, but user sends final message then stops — batch never flushes
**Why it happens:** Debounce logic waits for quiet period. If user's last message doesn't trigger another message, debounce timer never fires.
**How to avoid:** Use `asyncio.wait_for()` with max wait time (e.g., 5s). If timeout, flush batch even without pause.
**Warning signs:** User sends message, no response, batch stuck in queue waiting for non-existent next message
```python
# WRONG - Wait indefinitely
while not self.queue.empty():
await asyncio.sleep(2.0) # Wait for more messages
# What if no more messages come?
# RIGHT - Timeout fallback
try:
await asyncio.wait_for(self.queue.get(), timeout=5.0)
except asyncio.TimeoutError:
# Timeout reached, flush what we have
await self.flush_batch()
```
### Pitfall 5: Persistent Process Outlives Session Switch
**What goes wrong:** Switch to new session but old subprocess still running, both processes writing to same chat, confusing output
**Why it happens:** Session switch activates new session but doesn't suspend old subprocess. Both continue processing messages.
**How to avoid:** Track active subprocess per session, suspend (or terminate) old subprocess when switching. Phase 3 adds idle timeout for cleanup.
**Warning signs:** Multiple responses to single message, output from wrong session context
```python
# WRONG - Switch without cleanup
def switch_session(new_session):
self.active_session = new_session
# Old subprocess still running!
# RIGHT - Suspend old subprocess
async def switch_session(new_session):
if self.active_session and self.active_session.subprocess:
await self.active_session.subprocess.suspend()
self.active_session = new_session
```
## Code Examples
Verified patterns from official sources:
### Persistent Subprocess with stream-json I/O
```python
# Source: https://code.claude.com/docs/en/cli-reference
import asyncio
import json
from pathlib import Path
class PersistentClaudeSubprocess:
"""Manages persistent Claude Code subprocess with stream-json I/O."""
def __init__(self, session_dir: Path, persona: dict):
self.session_dir = session_dir
self.persona = persona
self.process = None
async def start(self):
"""Spawn persistent subprocess."""
cmd = [
'claude',
'--input-format', 'stream-json',
'--output-format', 'stream-json',
'--verbose',
'--continue',
]
# Add persona settings
if self.persona.get('system_prompt'):
cmd.extend(['--system-prompt', self.persona['system_prompt']])
settings = self.persona.get('settings', {})
if 'model' in settings:
cmd.extend(['--model', settings['model']])
if 'max_turns' in settings:
cmd.extend(['--max-turns', str(settings['max_turns'])])
self.process = await asyncio.create_subprocess_exec(
*cmd,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=str(self.session_dir)
)
# Start concurrent stream readers
asyncio.create_task(self._read_stdout())
asyncio.create_task(self._read_stderr())
async def send_message(self, message: str):
"""Send message to subprocess via NDJSON stdin."""
if not self.process or not self.process.stdin:
raise RuntimeError("Subprocess not running")
msg = {'content': message}
ndjson_line = json.dumps(msg) + '\n'
self.process.stdin.write(ndjson_line.encode())
await self.process.stdin.drain() # CRITICAL: flush buffer
async def _read_stdout(self):
"""Read stdout stream-json events."""
while True:
line = await self.process.stdout.readline()
if not line:
break
try:
event = json.loads(line.decode().rstrip())
await self._handle_event(event)
except json.JSONDecodeError:
pass
async def _read_stderr(self):
"""Read stderr logs."""
while True:
line = await self.process.stderr.readline()
if not line:
break
logger.warning(f"Claude stderr: {line.decode().rstrip()}")
async def _handle_event(self, event: dict):
"""Handle stream-json event."""
# Implement event routing (assistant, result, system)
pass
```
### Typing Indicator with Background Loop
```python
# Source: https://github.com/python-telegram-bot/python-telegram-bot/issues/2869
from telegram import ChatAction
import asyncio
async def maintain_typing_indicator(bot, chat_id: int, stop_event: asyncio.Event):
"""
Maintain typing indicator until stop_event is set.
Re-sends typing action every 4 seconds to keep indicator alive
for operations longer than 5 seconds.
"""
while not stop_event.is_set():
try:
await bot.send_chat_action(chat_id=chat_id, action=ChatAction.TYPING)
except Exception as e:
logger.warning(f"Failed to send typing indicator: {e}")
# Wait 4s or until stop_event
try:
await asyncio.wait_for(stop_event.wait(), timeout=4.0)
break # stop_event was set
except asyncio.TimeoutError:
continue # Timeout, re-send typing
# Usage in message handler
async def handle_user_message(update: Update, context: ContextTypes.DEFAULT_TYPE):
chat_id = update.effective_chat.id
# Start typing indicator loop
stop_typing = asyncio.Event()
typing_task = asyncio.create_task(
maintain_typing_indicator(context.bot, chat_id, stop_typing)
)
try:
# Send message to Claude and wait for response
await subprocess.send_message(update.message.text)
# Response arrives via subprocess callbacks
finally:
# Stop typing indicator
stop_typing.set()
await typing_task
```
### Smart Message Splitting with Code Block Detection
```python
# Source: https://limits.tginfo.me/en + MarkdownV2 research
import re
TELEGRAM_MAX_LENGTH = 4096
SAFE_LENGTH = 4000 # Leave room for escape characters
def split_message_smart(text: str) -> list[str]:
"""
Split long message at smart boundaries, respecting MarkdownV2 code blocks.
Never splits inside triple-backtick code blocks. Prefers paragraph breaks,
then line breaks, then character breaks as last resort.
"""
if len(text) <= SAFE_LENGTH:
return [text]
chunks = []
current_chunk = ""
in_code_block = False
lines = text.split('\n')
for line in lines:
# Track code block state
if line.strip().startswith('```'):
in_code_block = not in_code_block
# Check if adding this line exceeds limit
potential_chunk = current_chunk + ('\n' if current_chunk else '') + line
if len(potential_chunk) > SAFE_LENGTH:
# Would exceed limit
if in_code_block:
# Inside code block - must include whole block
current_chunk = potential_chunk
else:
# Can split here
if current_chunk:
chunks.append(current_chunk)
current_chunk = line
else:
current_chunk = potential_chunk
if current_chunk:
chunks.append(current_chunk)
return chunks
def escape_markdown_v2(text: str) -> str:
"""
Escape MarkdownV2 special characters.
Source: https://postly.ai/telegram/telegram-markdown-formatting
17 characters require escaping: _ * [ ] ( ) ~ ` > # + - = | { } . !
"""
escape_chars = r'_*[]()~`>#+-=|{}.!'
return re.sub(f'([{re.escape(escape_chars)}])', r'\\\1', text)
```
### Message Batching with Debounce
```python
# Source: https://github.com/LiraNuna/aio-batching + asyncio patterns
import asyncio
from typing import Callable, Optional
class MessageBatcher:
"""
Batch rapid messages with debounce timer.
Collects messages in queue, waits for pause (debounce_seconds),
then flushes batch via callback.
"""
def __init__(self, callback: Callable, debounce_seconds: float = 2.0):
self.callback = callback
self.debounce_seconds = debounce_seconds
self.queue: asyncio.Queue = asyncio.Queue()
self._batch_task: Optional[asyncio.Task] = None
async def add_message(self, message: str):
"""Add message to batch queue, reset debounce timer."""
await self.queue.put(message)
# Cancel existing timer and start new one
if self._batch_task and not self._batch_task.done():
self._batch_task.cancel()
try:
await self._batch_task
except asyncio.CancelledError:
pass
self._batch_task = asyncio.create_task(self._wait_and_flush())
async def _wait_and_flush(self):
"""Wait for debounce period, then flush batched messages."""
try:
await asyncio.sleep(self.debounce_seconds)
except asyncio.CancelledError:
return # Cancelled by new message
# Collect all queued messages
messages = []
while not self.queue.empty():
try:
msg = self.queue.get_nowait()
messages.append(msg)
except asyncio.QueueEmpty:
break
if messages:
# Combine and send to callback
combined = '\n\n'.join(messages)
await self.callback(combined)
# Usage
async def send_to_claude(combined_message: str):
"""Callback invoked when batch flushes."""
await subprocess.send_message(combined_message)
batcher = MessageBatcher(callback=send_to_claude, debounce_seconds=2.0)
# In message handler
async def handle_message(update: Update, context: ContextTypes.DEFAULT_TYPE):
await batcher.add_message(update.message.text)
```
### File Upload and Download
```python
# Source: https://github.com/python-telegram-bot/python-telegram-bot/wiki/Working-with-Files-and-Media
from pathlib import Path
from telegram import Update
from telegram.ext import ContextTypes
async def handle_photo(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Download photo to session folder."""
session_dir = get_active_session_dir()
# Get highest quality photo
photo = update.message.photo[-1]
file = await context.bot.get_file(photo.file_id)
# Download to session folder
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
filepath = session_dir / f'photo_{timestamp}.jpg'
await file.download_to_drive(filepath)
# Auto-analyze with Claude
caption = update.message.caption or ""
prompt = f"Analyze this photo: {filepath.name}"
if caption:
prompt += f"\nUser caption: {caption}"
await subprocess.send_message(prompt)
async def handle_document(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Download document to session folder."""
session_dir = get_active_session_dir()
doc = update.message.document
file = await context.bot.get_file(doc.file_id)
# Download with original filename
filepath = session_dir / doc.file_name
await file.download_to_drive(filepath)
# Notify Claude (don't auto-analyze, wait for user intent)
await subprocess.send_message(f"User uploaded file: {doc.file_name}")
async def send_file_to_user(bot, chat_id: int, filepath: Path):
"""Send file from session folder to user."""
with open(filepath, 'rb') as f:
await bot.send_document(
chat_id=chat_id,
document=f,
filename=filepath.name,
caption=f"Generated: {filepath.name}"
)
```
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| Fresh subprocess per turn | Persistent subprocess with stdin streaming | Claude Code 2.0+ (2024) | Eliminates ~1s spawn overhead, maintains conversation context |
| Telegram Markdown | MarkdownV2 with 17 escape chars | Telegram Bot API 4.5+ (2019) | Better formatting but complex escaping rules |
| Single typing indicator | Loop re-sending every 4s | Community best practice (2020+) | Maintains feedback for operations >5s |
| Sequential message sending | Batch with debounce timer | Modern asyncio patterns (2023+) | Reduces API calls, groups related messages |
**Deprecated/outdated:**
- **Telegram Markdown (v1):** Deprecated in favor of MarkdownV2, limited formatting options
- **Blocking subprocess.communicate():** Replaced by asyncio concurrent stream reading
- **PTY for non-interactive programs:** Unnecessary complexity, pipes + stream-json is standard
## Open Questions
Things that couldn't be fully resolved:
1. **Claude Code's --input-format stream-json message format**
- What we know: Accepts NDJSON on stdin, `{'content': 'message'}` format likely based on API message structure
- What's unclear: Full schema for stream-json input — does it support attachments, metadata, user role?
- Recommendation: Test with minimal `{'content': '...'}` structure first. Check official docs or CLI help for schema if basic format fails.
2. **Optimal debounce timing for message batching**
- What we know: 2-5s debounce is common for typing indicators, chat UX
- What's unclear: What's the sweet spot for balancing responsiveness vs batching effectiveness?
- Recommendation: Start with 2s debounce. If users complain about slow responses, reduce to 1s. If too many unbatched messages, increase to 3s. Make configurable.
3. **MarkdownV2 escape handling for Claude-generated content**
- What we know: 17 special chars require escaping, code blocks have different rules
- What's unclear: Should we escape Claude's output before sending, or let Claude generate pre-escaped markdown?
- Recommendation: Escape Claude's output in bot code before sending. Claude doesn't know it's outputting to Telegram MarkdownV2, so bot should handle escaping. Exception: If Claude is told to generate MarkdownV2 explicitly in system prompt.
4. **Tool call progress notification verbosity**
- What we know: Users want progress feedback ("Reading file...", "Running test...")
- What's unclear: Should every tool call get a notification? Only long-running ones? Editable single message or separate messages?
- Recommendation: Start with separate message per tool call. Phase 4 can add smart filtering (only notify if tool takes >2s) or consolidate into single editable message. User feedback will inform verbosity level.
## Sources
### Primary (HIGH confidence)
- [CLI reference - Claude Code Docs](https://code.claude.com/docs/en/cli-reference) - Official Claude Code documentation on `--input-format stream-json`
- [Stream-JSON Chaining - ruvnet/claude-flow Wiki](https://github.com/ruvnet/claude-flow/wiki/Stream-Chaining) - Community documentation on stream-json format and chaining
- [Working with Files and Media - python-telegram-bot Wiki](https://github.com/python-telegram-bot/python-telegram-bot/wiki/Working-with-Files-and-Media) - Official PTB file handling guide
- [Formatting Messages: MarkdownV2 & HTML - Postly Telegram Guides](https://postly.ai/telegram/telegram-markdown-formatting) - Comprehensive MarkdownV2 reference
### Secondary (MEDIUM confidence)
- [ChatAction TYPING - python-telegram-bot Issue #2869](https://github.com/python-telegram-bot/python-telegram-bot/issues/2869) - Typing indicator patterns from PTB maintainers
- [Telegram Limits - Telegram Info](https://limits.tginfo.me/en) - Official Telegram limits reference
- [aio-batching - GitHub](https://github.com/LiraNuna/aio-batching) - Asyncio batching library with examples
- [7 AsyncIO Patterns - Medium](https://medium.com/@connect.hashblock/7-asyncio-patterns-for-concurrency-friendly-python-685abeb2a534) - Practical asyncio patterns for services
### Tertiary (LOW confidence)
- WebSearch results on message splitting and debounce patterns - Multiple sources, cross-referenced but not deeply verified
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH - All components verified in use, versions confirmed
- Architecture: HIGH - Patterns sourced from official docs and proven libraries
- Pitfalls: MEDIUM-HIGH - Common issues documented across community sources, verified against official warnings
**Research date:** 2026-02-04
**Valid until:** 2026-03-04 (30 days - Python asyncio and Telegram API are stable, Claude Code CLI is evolving but backwards-compatible)

View file

@ -1,133 +0,0 @@
---
phase: 03-lifecycle-management
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- telegram/idle_timer.py
- telegram/session_manager.py
- telegram/claude_subprocess.py
autonomous: true
must_haves:
truths:
- "Per-session idle timer fires callback after configurable timeout seconds"
- "Timer resets on activity (cancel + restart)"
- "Session metadata includes idle_timeout field (default 600s)"
- "ClaudeSubprocess exposes its PID for metadata tracking"
artifacts:
- path: "telegram/idle_timer.py"
provides: "SessionIdleTimer class with asyncio-based per-session idle timers"
min_lines: 60
- path: "telegram/session_manager.py"
provides: "Session metadata with idle_timeout field, PID tracking"
contains: "idle_timeout"
- path: "telegram/claude_subprocess.py"
provides: "PID property for external access"
contains: "def pid"
key_links:
- from: "telegram/idle_timer.py"
to: "asyncio.create_task"
via: "Background sleep task with cancellation"
pattern: "asyncio\\.create_task.*_wait_for_timeout"
- from: "telegram/session_manager.py"
to: "metadata.json"
via: "idle_timeout stored in session metadata"
pattern: "idle_timeout"
---
<objective>
Create the idle timer module and extend session metadata for lifecycle management.
Purpose: Foundation components needed before wiring suspend/resume into the bot. The idle timer provides per-session timeout detection, and metadata extensions store timeout configuration and subprocess PIDs.
Output: New `idle_timer.py` module, updated `session_manager.py` and `claude_subprocess.py`
</objective>
<execution_context>
@/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md
@/home/mikkel/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-lifecycle-management/03-CONTEXT.md
@.planning/phases/03-lifecycle-management/03-RESEARCH.md
@telegram/idle_timer.py (will be created)
@telegram/session_manager.py
@telegram/claude_subprocess.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Create SessionIdleTimer module</name>
<files>telegram/idle_timer.py</files>
<action>
Create `telegram/idle_timer.py` with a `SessionIdleTimer` class that manages per-session idle timeouts using asyncio.
Class design:
- `__init__(self, session_name: str, timeout_seconds: int, on_timeout: Callable[[str], Awaitable[None]])` -- stores config, initializes _timer_task to None, _last_activity to now (UTC)
- `reset(self)` -- updates _last_activity to now, cancels existing _timer_task if running, creates new asyncio.create_task(_wait_for_timeout())
- `async _wait_for_timeout(self)` -- awaits asyncio.sleep(self.timeout_seconds), then calls `await self.on_timeout(self.session_name)`. Catches asyncio.CancelledError silently (timer was reset).
- `cancel(self)` -- cancels _timer_task if running (used on shutdown/archive)
- `@property seconds_since_activity` -- returns float seconds since _last_activity
- `@property last_activity` -- returns the datetime of last activity (for /sessions display)
Use `datetime.now(timezone.utc)` for timestamps. Import typing for Callable, Optional, Awaitable.
Add module docstring explaining this is the idle timeout manager for session lifecycle. Log timer start/cancel/fire events at DEBUG level, timeout firing at INFO level.
</action>
<verify>
`python3 -c "from idle_timer import SessionIdleTimer; print('import OK')"` run from telegram/ directory succeeds.
</verify>
<done>SessionIdleTimer class exists with reset(), cancel(), _wait_for_timeout(), seconds_since_activity, and last_activity. Imports cleanly.</done>
</task>
<task type="auto">
<name>Task 2: Extend session metadata and subprocess PID tracking</name>
<files>telegram/session_manager.py, telegram/claude_subprocess.py</files>
<action>
**session_manager.py changes:**
1. In `create_session()`, add `"idle_timeout": 600` (10 minutes default) to the initial metadata dict (alongside existing fields like name, created, last_active, persona, pid, status).
2. Add a helper method `get_session_timeout(self, name: str) -> int` that reads metadata and returns `metadata.get('idle_timeout', 600)`. This provides a clean interface for the bot to query timeout values.
3. No changes to list_sessions() -- it already returns full metadata which will now include idle_timeout.
**claude_subprocess.py changes:**
1. Add a `@property pid(self) -> Optional[int]` that returns `self._process.pid if self._process and self._process.returncode is None else None`. This lets the bot store the PID in session metadata for orphan cleanup on restart.
2. In `start()`, after successful subprocess spawn, store the PID in a `self._pid` attribute as well (for access even after process terminates, useful for logging). Keep the property returning live PID only.
These are minimal, targeted changes. Do NOT refactor existing code. Do NOT change the terminate() method or any existing logic.
</action>
<verify>
`python3 -c "from session_manager import SessionManager; sm = SessionManager(); print('SM OK')"` and `python3 -c "from claude_subprocess import ClaudeSubprocess; print('CS OK')"` both succeed from telegram/ directory.
</verify>
<done>Session metadata includes idle_timeout (default 600s). SessionManager has get_session_timeout() method. ClaudeSubprocess has pid property returning live process PID.</done>
</task>
</tasks>
<verification>
- `cd ~/homelab/telegram && python3 -c "from idle_timer import SessionIdleTimer; from session_manager import SessionManager; from claude_subprocess import ClaudeSubprocess; print('All imports OK')"`
- SessionIdleTimer has reset(), cancel(), seconds_since_activity, last_activity
- SessionManager.get_session_timeout() returns int
- ClaudeSubprocess.pid returns Optional[int]
</verification>
<success_criteria>
- idle_timer.py exists with SessionIdleTimer class implementing asyncio-based per-session idle timeout
- session_manager.py creates sessions with idle_timeout=600 in metadata and has get_session_timeout() helper
- claude_subprocess.py exposes pid property for PID tracking
- All three modules import without errors
</success_criteria>
<output>
After completion, create `.planning/phases/03-lifecycle-management/03-01-SUMMARY.md`
</output>

View file

@ -1,102 +0,0 @@
---
phase: 03-lifecycle-management
plan: 01
subsystem: infra
tags: [asyncio, python, session-management, lifecycle]
# Dependency graph
requires:
- phase: 02-telegram-integration
provides: Session management and persistent subprocess architecture
provides:
- SessionIdleTimer class for per-session timeout detection
- Session metadata with idle_timeout field for lifecycle configuration
- ClaudeSubprocess.pid property for process tracking
affects: [03-lifecycle-management]
# Tech tracking
tech-stack:
added: []
patterns:
- "Asyncio-based timer with reset() cancellation pattern"
- "Session metadata defaults for configurable behavior"
key-files:
created:
- telegram/idle_timer.py
modified:
- telegram/session_manager.py
- telegram/claude_subprocess.py
key-decisions:
- "Default 600s (10 min) idle timeout per session"
- "Timer reset via task cancellation + new task creation"
- "PID property returns live process ID only (None if terminated)"
patterns-established:
- "Timer pattern: cancel existing task, create new background sleep task"
- "Metadata defaults: provide sensible values in create_session()"
# Metrics
duration: 2min
completed: 2026-02-04
---
# Phase 03 Plan 01: Idle Timer Foundation Summary
**Asyncio-based per-session idle timer with configurable timeout metadata and subprocess PID tracking**
## Performance
- **Duration:** 2 min
- **Started:** 2026-02-04T23:27:29Z
- **Completed:** 2026-02-04T23:29:00Z
- **Tasks:** 2
- **Files modified:** 3
## Accomplishments
- Created SessionIdleTimer class with asyncio timer management
- Extended session metadata to include idle_timeout field (default 600s)
- Added PID property to ClaudeSubprocess for process tracking
- Foundation ready for suspend/resume lifecycle implementation
## Task Commits
Each task was committed atomically:
1. **Task 1: Create SessionIdleTimer module** - `488d94e` (feat)
2. **Task 2: Extend session metadata and subprocess PID tracking** - `74f12a1` (feat)
## Files Created/Modified
- `telegram/idle_timer.py` - SessionIdleTimer class with reset(), cancel(), and activity tracking properties
- `telegram/session_manager.py` - Added idle_timeout to metadata, get_session_timeout() helper method
- `telegram/claude_subprocess.py` - Added pid property returning live process ID
## Decisions Made
- Default idle timeout: 600 seconds (10 minutes) - balances responsiveness with resource conservation
- Timer reset pattern: Cancel existing asyncio task and create new one (clean slate approach)
- PID property returns None for terminated processes - prevents stale PID references
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## User Setup Required
None - no external service configuration required.
## Next Phase Readiness
Ready for next plan (03-02: Suspend/Resume Implementation):
- Idle timer module complete and tested
- Session metadata supports timeout configuration
- Subprocess exposes PID for lifecycle tracking
- All imports verified, no blockers
---
*Phase: 03-lifecycle-management*
*Completed: 2026-02-04*

View file

@ -1,311 +0,0 @@
---
phase: 03-lifecycle-management
plan: 02
type: execute
wave: 2
depends_on: ["03-01"]
files_modified:
- telegram/bot.py
autonomous: true
must_haves:
truths:
- "Session suspends automatically after idle timeout (subprocess terminated, status set to suspended)"
- "User message to suspended session resumes it with --continue and shows 'Resuming session...' status"
- "Resume failure sends error to user and does not auto-create fresh session"
- "Race between timeout-fire and user-message is prevented by asyncio.Lock"
- "Bot startup kills orphaned subprocess PIDs and sets all sessions to suspended"
- "Bot shutdown terminates all subprocesses gracefully (SIGTERM + 5s timeout + SIGKILL)"
- "/timeout <minutes> sets per-session idle timeout (1-120 range)"
- "/sessions lists all sessions with status indicator, persona, and last active time"
artifacts:
- path: "telegram/bot.py"
provides: "Suspend/resume wiring, idle timers, /timeout, /sessions, startup cleanup, graceful shutdown"
contains: "idle_timers"
key_links:
- from: "telegram/bot.py"
to: "telegram/idle_timer.py"
via: "import and instantiate SessionIdleTimer per session"
pattern: "from idle_timer import SessionIdleTimer"
- from: "telegram/bot.py on_complete callback"
to: "idle_timer.reset()"
via: "Timer starts after Claude finishes processing"
pattern: "idle_timers.*reset"
- from: "telegram/bot.py handle_message"
to: "resume logic"
via: "Detect suspended session, spawn with --continue, send status"
pattern: "Resuming session"
- from: "telegram/bot.py suspend_session"
to: "ClaudeSubprocess.terminate()"
via: "Idle timer fires, terminates subprocess"
pattern: "await.*terminate"
---
<objective>
Wire suspend/resume lifecycle, idle timers, new commands, and cleanup into the bot.
Purpose: This is the core integration plan that makes sessions automatically suspend after idle timeout, resume transparently on user message, and provides /timeout + /sessions commands. Also adds startup orphan cleanup and graceful shutdown signal handling.
Output: Updated `bot.py` with full lifecycle management
</objective>
<execution_context>
@/home/mikkel/.claude/get-shit-done/workflows/execute-plan.md
@/home/mikkel/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/03-lifecycle-management/03-CONTEXT.md
@.planning/phases/03-lifecycle-management/03-RESEARCH.md
@.planning/phases/03-lifecycle-management/03-01-SUMMARY.md
@telegram/bot.py
@telegram/idle_timer.py
@telegram/session_manager.py
@telegram/claude_subprocess.py
</context>
<tasks>
<task type="auto">
<name>Task 1: Suspend/resume wiring with race locks, startup cleanup, and graceful shutdown</name>
<files>telegram/bot.py</files>
<action>
This is the core lifecycle wiring in bot.py. Make these changes:
**New imports and globals:**
- `import signal, os` (for shutdown handlers and PID checks)
- `from idle_timer import SessionIdleTimer`
- Add global dict: `idle_timers: dict[str, SessionIdleTimer] = {}`
- Add global dict: `subprocess_locks: dict[str, asyncio.Lock] = {}` (one lock per session, prevents races between timeout-fire and user-message)
**Helper: get_subprocess_lock(session_name)**
- Returns existing lock or creates new one for session. Pattern: `subprocess_locks.setdefault(session_name, asyncio.Lock())`
**Suspend function: `async def suspend_session(session_name: str)`**
- This is the idle timer's on_timeout callback.
- Acquire the session's subprocess lock.
- Check if subprocess exists and is_alive. If not alive, just update metadata and return.
- Check `subprocesses[session_name].is_busy` -- if busy, DON'T suspend (Claude is mid-processing). Instead, reset the idle timer to try again later. Log this. Return.
- Store the subprocess PID for logging.
- Call `await subprocesses[session_name].terminate()` (existing method with SIGTERM + timeout + SIGKILL).
- Remove from `subprocesses` dict.
- Flush and remove batcher if exists: `if session_name in batchers: await batchers[session_name].flush_immediately(); del batchers[session_name]`
- Update session metadata: `session_manager.update_session(session_name, status='suspended', pid=None)`
- Cancel and remove idle timer: `if session_name in idle_timers: idle_timers[session_name].cancel(); del idle_timers[session_name]`
- Log: `logger.info(f"Session '{session_name}' suspended after idle timeout")`
- DECISION (from CONTEXT.md): Silent suspension -- do NOT send any Telegram message.
**Modify make_callbacks() -- add on_complete idle timer integration:**
- The `on_complete` callback already exists. Wrap it: after existing logic (stop typing), add idle timer reset:
```python
# Reset idle timer (only start counting AFTER Claude finishes)
if session_name in idle_timers:
idle_timers[session_name].reset()
```
- This ensures timer only starts when Claude is truly idle, never during processing.
**Modify handle_message() -- add resume logic:**
- After checking for active session, BEFORE the subprocess check, add:
```python
# Acquire lock to prevent race with suspend_session
lock = get_subprocess_lock(active_session)
async with lock:
```
Wrap the subprocess get-or-create and message send in this lock.
- Inside the lock, when subprocess is not alive:
1. Check if session has `.claude/` dir (has history). If yes, this is a resume.
2. If resuming: send status message to user: `"Resuming session..."` (include idle duration if >1 min from metadata last_active). Example: `"Resuming session (idle for 15 min)..."`
3. Spawn subprocess normally (the existing ClaudeSubprocess constructor + start() already handles --continue when .claude/ exists).
4. Store PID in metadata: `session_manager.update_session(active_session, status='active', last_active=now_iso, pid=subprocesses[active_session].pid)`
- After sending message (outside lock), create/reset idle timer for the session:
```python
timeout_secs = session_manager.get_session_timeout(active_session)
if active_session not in idle_timers:
idle_timers[active_session] = SessionIdleTimer(active_session, timeout_secs, on_timeout=suspend_session)
# Don't reset here -- timer resets in on_complete when Claude finishes
```
- IMPORTANT: Also reset the idle timer when user sends a message (user activity should reset timer too, per CONTEXT.md):
```python
if active_session in idle_timers:
idle_timers[active_session].reset()
```
Put this BEFORE sending to subprocess (so timer is reset even if message queues).
**Similarly update handle_photo() and handle_document():**
- Add the same lock acquisition, resume detection, and idle timer reset as handle_message().
- Keep the existing photo/document save and notification logic.
**Modify new_session() -- initialize idle timer after creation:**
- After subprocess creation, add:
```python
timeout_secs = session_manager.get_session_timeout(name)
idle_timers[name] = SessionIdleTimer(name, timeout_secs, on_timeout=suspend_session)
```
- Store PID in metadata: after subprocess is created/started, `session_manager.update_session(name, pid=subprocesses[name].pid)` (only after start()).
Note: The existing code creates ClaudeSubprocess but does NOT call start() -- start happens lazily on first send_message. So PID tracking happens in handle_message when subprocess auto-starts.
**Modify switch_session_cmd():**
- Per CONTEXT.md LOCKED decision: switching sessions leaves previous subprocess running (it suspends on its own timer). Do NOT cancel old session's idle timer.
- When auto-spawning subprocess for new session, set up idle timer as above.
**Modify archive_session_cmd():**
- Cancel idle timer if exists: `if name in idle_timers: idle_timers[name].cancel(); del idle_timers[name]`
- Remove subprocess lock if exists: `subprocess_locks.pop(name, None)`
**Modify model_cmd():**
- After terminating subprocess for model change, cancel idle timer: `if active_session in idle_timers: idle_timers[active_session].cancel(); del idle_timers[active_session]`
**Startup cleanup function: `async def cleanup_orphaned_subprocesses()`**
- Called once at bot startup (before polling starts).
- Iterate all sessions via `session_manager.list_sessions()`.
- For each session with a non-None `pid`:
1. Check if PID process exists: `os.kill(pid, 0)` wrapped in try/except ProcessLookupError.
2. If process exists, verify it's a claude process: read `/proc/{pid}/cmdline`, check if "claude" is in it. If not claude, skip killing.
3. If it IS a claude process: `os.kill(pid, signal.SIGTERM)`, sleep 2s, then try `os.kill(pid, signal.SIGKILL)` (catch ProcessLookupError if already dead).
4. Update metadata: `session_manager.update_session(session['name'], pid=None, status='suspended')`
- For sessions with status != 'suspended' and no pid, also set status to 'suspended'.
- Log summary: "Cleaned up N orphaned subprocesses"
**Graceful shutdown:**
- python-telegram-bot's `Application.run_polling()` handles signal installation internally. Instead of overriding signal handlers (which conflicts with the library), use the `post_shutdown` callback:
```python
async def post_shutdown(application):
"""Clean up subprocesses and timers on bot shutdown."""
logger.info("Bot shutting down, cleaning up...")
# Cancel all idle timers
for name, timer in idle_timers.items():
timer.cancel()
# Terminate all subprocesses
for name, proc in list(subprocesses.items()):
if proc.is_alive:
logger.info(f"Terminating subprocess for '{name}'")
await proc.terminate()
logger.info("Cleanup complete")
```
- Register in main(): `app.post_shutdown = post_shutdown`
- Also add a `post_init` callback for startup cleanup:
```python
async def post_init(application):
"""Run startup cleanup."""
await cleanup_orphaned_subprocesses()
```
Register: `app = Application.builder().token(TOKEN).post_init(post_init).build()`
**Update help text:**
- Add `/timeout <minutes>` and `/sessions` to the help_command text under "Claude Sessions" section.
</action>
<verify>
`python3 -c "import bot"` from telegram/ directory should not error (syntax check). Look for: idle_timers dict, subprocess_locks dict, suspend_session function, cleanup_orphaned_subprocesses function, post_shutdown callback.
</verify>
<done>
- suspend_session() terminates subprocess on idle timeout, updates metadata to suspended, silent (no Telegram notification)
- handle_message() detects suspended session, sends "Resuming session..." status, spawns with --continue
- Race lock prevents concurrent suspend + resume on same session
- Startup cleanup kills orphaned PIDs verified via /proc/cmdline
- Graceful shutdown terminates all subprocesses and cancels all timers
- handle_photo/handle_document also support resume from suspended state
</done>
</task>
<task type="auto">
<name>Task 2: /timeout and /sessions commands</name>
<files>telegram/bot.py</files>
<action>
Add two new command handlers to bot.py:
**/timeout command: `async def timeout_cmd(update, context)`**
- Auth check (same pattern as other commands).
- If no active session: reply "No active session. Use /new <name> to start one."
- If no args: show current timeout.
```python
timeout_secs = session_manager.get_session_timeout(active_session)
minutes = timeout_secs // 60
await update.message.reply_text(f"Idle timeout: {minutes} minutes\n\nUsage: /timeout <minutes> (1-120)")
```
- If args: parse first arg as int.
- Validate range 1-120. If out of range: `"Timeout must be between 1 and 120 minutes"`
- If not a valid int: `"Invalid number. Usage: /timeout <minutes>"`
- Convert to seconds: `timeout_seconds = minutes * 60`
- Update session metadata: `session_manager.update_session(active_session, idle_timeout=timeout_seconds)`
- If idle timer exists for this session, update its timeout_seconds attribute and reset: `idle_timers[active_session].timeout_seconds = timeout_seconds; idle_timers[active_session].reset()`
- Reply: `f"Idle timeout set to {minutes} minutes for session '{active_session}'."`
**/sessions command: `async def sessions_cmd(update, context)`**
- Auth check.
- Get all sessions: `session_manager.list_sessions()` (already sorted by last_active desc).
- If empty: reply "No sessions. Use /new <name> to create one."
- Build formatted list. For each session:
- Status indicator: active subprocess running -> "LIVE", status == "active" (in metadata) -> "ACTIVE", status == "suspended" -> "IDLE", else -> status
- Actually, check real subprocess state: `name in subprocesses and subprocesses[name].is_alive` -> "LIVE"
- Format last_active as relative time (e.g., "2m ago", "1h ago", "3d ago") using a small helper function:
```python
def format_relative_time(iso_str):
dt = datetime.fromisoformat(iso_str)
delta = datetime.now(timezone.utc) - dt
secs = delta.total_seconds()
if secs < 60: return "just now"
if secs < 3600: return f"{int(secs/60)}m ago"
if secs < 86400: return f"{int(secs/3600)}h ago"
return f"{int(secs/86400)}d ago"
```
- Mark current active session with arrow prefix.
- Format line: `"{marker}{status_emoji} {name} ({persona}) - {relative_time}"`
- Status emojis: LIVE -> green circle, IDLE/suspended -> white circle
- Join lines, reply with parse_mode='Markdown'. Use backticks around session names for monospace.
**Register handlers in main():**
- `app.add_handler(CommandHandler("timeout", timeout_cmd))` -- after the model handler
- `app.add_handler(CommandHandler("sessions", sessions_cmd))` -- after the session handler
**Update help text in help_command():**
- Under "Claude Sessions" section, add:
- `/sessions` - List all sessions with status
- `/timeout <minutes>` - Set idle timeout (1-120)
</action>
<verify>
`python3 -c "import bot; print('OK')"` succeeds. Grep for "timeout_cmd" and "sessions_cmd" in bot.py to confirm both exist. Grep for "CommandHandler.*timeout" and "CommandHandler.*sessions" to confirm registration.
</verify>
<done>
- /timeout shows current timeout when called without args, sets timeout (1-120 min range) when called with arg
- /sessions lists all sessions sorted by last active, showing live/idle status, persona, relative time
- Both commands registered as handlers in main()
- Help text updated with new commands
</done>
</task>
</tasks>
<verification>
1. `cd ~/homelab/telegram && python3 -c "import bot; print('All OK')"` -- no import errors
2. Grep for key integration points:
- `grep -n "suspend_session" telegram/bot.py` -- suspend function exists
- `grep -n "idle_timers" telegram/bot.py` -- idle timer dict used
- `grep -n "subprocess_locks" telegram/bot.py` -- race locks exist
- `grep -n "cleanup_orphaned" telegram/bot.py` -- startup cleanup exists
- `grep -n "post_shutdown" telegram/bot.py` -- graceful shutdown exists
- `grep -n "Resuming session" telegram/bot.py` -- resume status message exists
- `grep -n "timeout_cmd\|sessions_cmd" telegram/bot.py` -- new commands exist
3. Restart bot service: `systemctl --user restart telegram-bot.service && sleep 2 && systemctl --user status telegram-bot.service` -- should show active
</verification>
<success_criteria>
- Session auto-suspends after idle timeout (subprocess terminated, metadata status=suspended, no Telegram notification)
- Message to suspended session shows "Resuming session..." then Claude responds with full history
- If resume fails, error message sent (no auto-fresh-start)
- asyncio.Lock prevents race between timeout-fire and incoming message
- Bot startup kills orphaned subprocess PIDs (verified via /proc/cmdline)
- Bot shutdown terminates all subprocesses gracefully
- /timeout <minutes> sets per-session idle timeout (1-120 range), shows current value without args
- /sessions lists all sessions with LIVE/IDLE status, persona, and relative last-active time
- Help text includes new commands
- Bot service restarts cleanly
</success_criteria>
<output>
After completion, create `.planning/phases/03-lifecycle-management/03-02-SUMMARY.md`
</output>

View file

@ -1,124 +0,0 @@
---
phase: 03-lifecycle-management
plan: 02
subsystem: bot-lifecycle
tags: [asyncio, telegram, subprocess-management, idle-timeout, graceful-shutdown]
# Dependency graph
requires:
- phase: 03-01
provides: SessionIdleTimer class with reset/cancel and PID tracking in ClaudeSubprocess
provides:
- Automatic session suspension after idle timeout
- Transparent session resume with history preservation
- Race-free suspend/resume via asyncio.Lock per session
- Orphaned subprocess cleanup at bot startup
- Graceful shutdown with subprocess termination
- /timeout command for per-session idle configuration
- /sessions command for session status overview
affects: [03-03-output-modes]
# Tech tracking
tech-stack:
added: []
patterns:
- "Race prevention via per-session asyncio.Lock for concurrent suspend/resume"
- "Silent suspension (no Telegram notification) per CONTEXT.md decision"
- "Resume detection via .claude/ directory existence check"
- "Idle timer reset in on_complete callback (timer only counts after Claude finishes)"
key-files:
created: []
modified:
- telegram/bot.py
key-decisions:
- "Silent suspension (no Telegram notification) per CONTEXT.md LOCKED decision"
- "Race prevention via subprocess_locks dict: one asyncio.Lock per session"
- "Resume shows idle duration if >1 min (e.g., 'Resuming session (idle for 15 min)...')"
- "Orphaned PID verification via /proc/cmdline check (only kill claude processes)"
- "Bot shutdown uses post_shutdown callback (python-telegram-bot handles signals)"
patterns-established:
- "Per-session locking: subprocess_locks.setdefault(session_name, asyncio.Lock())"
- "Idle timer lifecycle: create on session spawn, reset in on_complete, cancel on archive"
- "Resume status message format: 'Resuming session (idle for Xm)...'"
# Metrics
duration: 4min
completed: 2026-02-04
---
# Phase 3 Plan 2: Suspend/Resume Implementation Summary
**Automatic session suspension after 10min idle, transparent resume with full history, race-free with asyncio.Lock per session**
## Performance
- **Duration:** 4 min
- **Started:** 2026-02-04T23:33:30Z
- **Completed:** 2026-02-04T23:37:56Z
- **Tasks:** 2
- **Files modified:** 1
## Accomplishments
- Sessions automatically suspend after idle timeout (subprocess terminated, metadata updated, silent)
- User messages to suspended sessions transparently resume with full history
- Race condition between timeout-fire and user-message prevented via asyncio.Lock per session
- Bot startup kills orphaned subprocess PIDs (verified via /proc/cmdline)
- Bot shutdown terminates all subprocesses gracefully (SIGTERM + timeout + SIGKILL)
- /timeout command sets per-session idle timeout (1-120 min range)
- /sessions command lists all sessions with LIVE/IDLE status, persona, and relative last-active time
## Task Commits
Each task was committed atomically:
1. **Task 1: Suspend/resume wiring with race locks, startup cleanup, and graceful shutdown** - `6ebdb4a` (feat)
- suspend_session() callback for idle timer
- get_subprocess_lock() helper to prevent races
- Resume logic in handle_message/handle_photo/handle_document
- Idle timer reset in on_complete and on user activity
- cleanup_orphaned_subprocesses() with /proc/cmdline verification
- post_init() and post_shutdown() lifecycle callbacks
- Updated new_session, switch_session_cmd, archive_session_cmd, model_cmd
2. **Task 2: /timeout and /sessions commands** - `06c5246` (feat)
- timeout_cmd() to set/show per-session idle timeout
- sessions_cmd() to list all sessions with status
- Registered both commands in main()
## Files Created/Modified
- `telegram/bot.py` - Added suspend/resume lifecycle, idle timers, race locks, startup cleanup, graceful shutdown, /timeout and /sessions commands
## Decisions Made
**From plan execution:**
- Resume status message shows idle duration if >1 min: "Resuming session (idle for 15 min)..."
- Orphaned subprocess cleanup verifies PID is a claude process via /proc/cmdline before killing
- Bot shutdown uses post_shutdown callback (python-telegram-bot Application handles signal installation internally)
**Already documented in STATE.md:**
- Silent suspension (no Telegram notification) - from CONTEXT.md LOCKED decision
- Switching sessions leaves previous subprocess running (suspends on its own timer) - from CONTEXT.md LOCKED decision
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None
## Next Phase Readiness
**Ready for Phase 3 Plan 3 (Output Modes):**
- Session lifecycle fully implemented (suspend/resume, timeout configuration, status commands)
- Subprocess management robust (startup cleanup, graceful shutdown)
- Race conditions handled via per-session locks
**No blockers or concerns**
---
*Phase: 03-lifecycle-management*
*Completed: 2026-02-04*

View file

@ -1,64 +0,0 @@
# Phase 3: Lifecycle Management - Context
**Gathered:** 2026-02-04
**Status:** Ready for planning
<domain>
## Phase Boundary
Sessions suspend automatically after configurable idle timeout and resume transparently with full conversation history. Includes `/timeout` and `/sessions` commands. Graceful cleanup on bot restart with no zombie processes.
</domain>
<decisions>
## Implementation Decisions
### Suspend/resume feedback
- Silent suspension — no notification sent when a session auto-suspends
- On resume: send brief status message ("Resuming session...") before Claude's actual response
- Claude's Discretion: whether to include idle duration in the resume message
### Resume mechanism
- Use Claude Code's `--resume` flag to restore full conversation state from before suspension
- If resume fails: send error message to user and wait for their decision (don't auto-start fresh)
### Idle detection rules
- Activity = both user messages AND Claude responses/tool use reset the idle timer
- Never suspend while Claude is actively processing (timer only starts after response completes)
- Per-session idle timeout, stored in session metadata
- Default: 10 minutes
- Claude's Discretion: allowed range for `/timeout` command
### Multi-session behavior
- Multiple sessions can have live subprocesses simultaneously, each with independent idle timers
- Switching sessions with `/session` leaves the previous session's subprocess running (it'll suspend on its own idle timer)
- Claude's Discretion: `/sessions` list format and detail level
- Claude's Discretion: message routing approach (always-active-session vs session prefix)
### Startup & cleanup
- On bot restart: all sessions start in suspended state — no subprocesses launched until user messages
- Unlimited total sessions — no cap, sessions accumulate until manually archived
- Claude's Discretion: suspend method (SIGTERM + --resume vs SIGSTOP/SIGCONT) based on resource trade-offs
</decisions>
<specifics>
## Specific Ideas
- Resume should feel fast and transparent — user messages a suspended session and it "just works" with a brief status indicator
- The system should be invisible when working — no notifications on suspend, minimal friction on resume
- Per-session timeouts allow long-running task sessions to stay alive longer while quick sessions clean up fast
</specifics>
<deferred>
## Deferred Ideas
None — discussion stayed within phase scope
</deferred>
---
*Phase: 03-lifecycle-management*
*Context gathered: 2026-02-04*

View file

@ -1,951 +0,0 @@
# Phase 3: Lifecycle Management - Research
**Researched:** 2026-02-04
**Domain:** Process lifecycle (suspend/resume), asyncio idle timeout detection, graceful shutdown patterns, Claude Code --resume flag
**Confidence:** HIGH
## Summary
Phase 3 implements automatic session suspension after configurable idle timeout and transparent resumption with full conversation history. The core technical challenges are: (1) detecting true idle state (no user messages AND no Claude activity), (2) choosing between SIGSTOP/SIGCONT (pause in-place) vs SIGTERM + --resume (terminate and restart), and (3) graceful cleanup on bot restart to prevent zombie processes.
Research confirms that asyncio provides robust timeout primitives (`asyncio.Event`, `asyncio.wait_for`, `asyncio.create_task`) for per-session idle timers. Claude Code's `--continue` flag already handles session resumption from `.claude/` state in the session directory — no separate `--resume` flag is needed when using persistent subprocesses in one directory. The critical decision is suspension method: SIGSTOP/SIGCONT saves spawn overhead but keeps memory allocated, while SIGTERM + restart trades memory for CPU overhead.
Key findings: (1) Idle detection requires tracking both user message time AND Claude completion time to avoid suspending mid-processing, (2) SIGSTOP/SIGCONT keeps process memory allocated but saves ~1s restart overhead, (3) SIGTERM + --continue is safer for long idle periods (releases memory, prevents stale state), (4) Graceful shutdown requires signal handlers to cancel idle timer tasks and terminate subprocesses with timeout + SIGKILL fallback.
**Primary recommendation:** Use SIGTERM + restart approach for suspension. Track last activity timestamp per session. After idle timeout, terminate subprocess gracefully (SIGTERM with 5s timeout, SIGKILL fallback). On next user message, spawn fresh subprocess with `--continue` to restore context. This balances memory efficiency (released during idle) with reasonable restart cost (~1s). Store timeout value in session metadata for per-session configuration.
## Standard Stack
The established libraries/tools for this domain:
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| asyncio | stdlib (3.12+) | Timeout detection, task scheduling, signal handling | Native async primitives for idle timers, event-based cancellation |
| Claude Code CLI | 2.1.31+ | Session resumption via --continue | Built-in session state persistence to `.claude/` directory |
| signal (stdlib) | stdlib | SIGTERM/SIGKILL for graceful shutdown | Standard Unix signal handling for process termination |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| datetime (stdlib) | stdlib | Last activity timestamps | Track idle periods per session |
| json (stdlib) | stdlib | Session metadata updates | Store timeout configuration per session |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| SIGTERM + restart | SIGSTOP/SIGCONT | Pause keeps memory but saves 1s restart; terminate releases memory but costs CPU |
| Per-session timers | Global timeout for all sessions | Per-session allows custom timeouts (long for task sessions, short for chat) |
| asyncio.Event cancellation | Thread-based timers | asyncio integrates cleanly with subprocess management, threads add complexity |
**Installation:**
```bash
# All components are stdlib or already installed
python3 --version # 3.12+ required for modern asyncio
claude --version # 2.1.31 (already installed)
```
## Architecture Patterns
### Recommended Lifecycle State Machine
```
Session States:
├── Created (no subprocess) → User message → Active
├── Active (subprocess running, processing) → Completion → Idle
├── Idle (subprocess running, waiting) → Timeout → Suspended
├── Suspended (no subprocess) → User message → Active (restart)
└── Any state → Bot restart → Suspended (cleanup)
Idle Timer:
- Starts: After Claude completion event (subprocess.on_complete)
- Resets: On user message OR Claude starts processing
- Fires: After idle_timeout seconds of inactivity
- Action: Terminate subprocess (SIGTERM, 5s timeout, SIGKILL fallback)
```
### Pattern 1: Per-Session Idle Timer with asyncio
**What:** Track last activity timestamp, spawn background task to check timeout, cancel on activity
**When to use:** After each message completion, restart on new message
**Example:**
```python
# Source: https://docs.python.org/3/library/asyncio-task.html
import asyncio
from datetime import datetime, timezone
class SessionIdleTimer:
"""Manages idle timeout for a session."""
def __init__(self, session_name: str, timeout_seconds: int, on_timeout: callable):
self.session_name = session_name
self.timeout_seconds = timeout_seconds
self.on_timeout = on_timeout
self._timer_task: Optional[asyncio.Task] = None
self._last_activity = datetime.now(timezone.utc)
def reset(self):
"""Reset idle timer on activity."""
self._last_activity = datetime.now(timezone.utc)
# Cancel existing timer
if self._timer_task and not self._timer_task.done():
self._timer_task.cancel()
# Start new timer
self._timer_task = asyncio.create_task(self._wait_for_timeout())
async def _wait_for_timeout(self):
"""Wait for timeout duration, then fire callback."""
try:
await asyncio.sleep(self.timeout_seconds)
# Timeout reached - fire callback
await self.on_timeout(self.session_name)
except asyncio.CancelledError:
# Timer was reset by activity
pass
def cancel(self):
"""Cancel idle timer on session shutdown."""
if self._timer_task and not self._timer_task.done():
self._timer_task.cancel()
# Usage in bot
idle_timers: dict[str, SessionIdleTimer] = {}
async def on_message_complete(session_name: str):
"""Called when Claude finishes processing."""
# Start idle timer after completion
if session_name not in idle_timers:
timeout = get_session_timeout(session_name) # From metadata
idle_timers[session_name] = SessionIdleTimer(
session_name,
timeout,
on_timeout=suspend_session
)
idle_timers[session_name].reset()
async def on_user_message(session_name: str, message: str):
"""Called when user sends message."""
# Reset timer on activity
if session_name in idle_timers:
idle_timers[session_name].reset()
# Send to Claude...
```
### Pattern 2: Graceful Subprocess Termination
**What:** Send SIGTERM, wait for clean exit with timeout, SIGKILL if needed
**When to use:** Suspending session, bot shutdown, session archival
**Example:**
```python
# Source: https://roguelynn.com/words/asyncio-graceful-shutdowns/
import asyncio
import signal
async def terminate_subprocess_gracefully(
process: asyncio.subprocess.Process,
timeout: int = 5
) -> None:
"""
Terminate subprocess with graceful shutdown.
1. Close stdin to signal end of input
2. Send SIGTERM for graceful shutdown
3. Wait up to timeout seconds
4. SIGKILL if still running
5. Always reap process to prevent zombie
"""
if not process or process.returncode is not None:
return # Already terminated
try:
# Close stdin to signal no more input
if process.stdin:
process.stdin.close()
await process.stdin.wait_closed()
# Send SIGTERM for graceful shutdown
process.terminate()
# Wait for clean exit
try:
await asyncio.wait_for(process.wait(), timeout=timeout)
logger.info(f"Process {process.pid} terminated gracefully")
except asyncio.TimeoutError:
# Timeout - force kill
logger.warning(f"Process {process.pid} did not terminate, sending SIGKILL")
process.kill()
await process.wait() # CRITICAL: Always reap to prevent zombie
logger.info(f"Process {process.pid} killed")
except Exception as e:
logger.error(f"Error terminating process: {e}")
# Force kill as last resort
try:
process.kill()
await process.wait()
except:
pass
```
### Pattern 3: Session Resume with --continue
**What:** Spawn subprocess with `--continue` flag to restore conversation from `.claude/` state
**When to use:** First message after suspension, bot restart resuming active session
**Example:**
```python
# Source: https://code.claude.com/docs/en/cli-reference
async def resume_session(session_name: str) -> ClaudeSubprocess:
"""
Resume suspended session by spawning subprocess with --continue.
Claude Code automatically loads conversation history from .claude/
directory in session folder.
"""
session_dir = get_session_dir(session_name)
persona = load_persona_for_session(session_name)
# Check if .claude directory exists (has prior conversation)
has_history = (session_dir / ".claude").exists()
cmd = [
'claude',
'-p',
'--input-format', 'stream-json',
'--output-format', 'stream-json',
'--verbose',
'--dangerously-skip-permissions',
]
# Add --continue if session has history
if has_history:
cmd.append('--continue')
logger.info(f"Resuming session '{session_name}' with --continue")
else:
logger.info(f"Starting fresh session '{session_name}'")
# Add persona settings (model, system prompt, etc)
if persona:
settings = persona.get('settings', {})
if 'model' in settings:
cmd.extend(['--model', settings['model']])
if 'system_prompt' in persona:
cmd.extend(['--append-system-prompt', persona['system_prompt']])
# Spawn subprocess
subprocess = ClaudeSubprocess(
session_dir=session_dir,
persona=persona,
on_output=...,
on_error=...,
on_complete=lambda: on_message_complete(session_name),
on_status=...,
on_tool_use=...,
)
await subprocess.start()
return subprocess
```
### Pattern 4: Bot Shutdown with Subprocess Cleanup
**What:** Signal handler to cancel all idle timers and terminate all subprocesses on SIGTERM/SIGINT
**When to use:** Bot stop, systemctl stop, Ctrl+C
**Example:**
```python
# Source: https://roguelynn.com/words/asyncio-graceful-shutdowns/ +
# https://github.com/wbenny/python-graceful-shutdown
import signal
import asyncio
async def shutdown(sig: signal.Signals, loop: asyncio.AbstractEventLoop):
"""
Graceful shutdown handler for bot.
1. Log signal received
2. Cancel all idle timers
3. Terminate all subprocesses gracefully
4. Cancel all outstanding tasks
5. Stop event loop
"""
logger.info(f"Received exit signal {sig.name}")
# Cancel all idle timers
for timer in idle_timers.values():
timer.cancel()
# Terminate all active subprocesses
termination_tasks = []
for session_name, subprocess in subprocesses.items():
if subprocess.is_alive:
logger.info(f"Terminating subprocess for session '{session_name}'")
termination_tasks.append(
terminate_subprocess_gracefully(subprocess._process, timeout=5)
)
# Wait for all terminations to complete
if termination_tasks:
await asyncio.gather(*termination_tasks, return_exceptions=True)
# Cancel all other tasks
tasks = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
for task in tasks:
task.cancel()
# Wait for cancellation, ignore exceptions
await asyncio.gather(*tasks, return_exceptions=True)
# Stop the loop
loop.stop()
# Install signal handlers on startup
def main():
app = Application.builder().token(TOKEN).build()
# Add signal handlers
loop = asyncio.get_event_loop()
signals = (signal.SIGTERM, signal.SIGINT)
for sig in signals:
loop.add_signal_handler(
sig,
lambda s=sig: asyncio.create_task(shutdown(s, loop))
)
# Start bot
app.run_polling()
```
### Pattern 5: Session Metadata for Timeout Configuration
**What:** Store idle_timeout in session metadata, allow per-session customization via /timeout command
**When to use:** Session creation, /timeout command handler
**Example:**
```python
# Session metadata structure
{
"name": "task-session",
"created": "2026-02-04T12:00:00+00:00",
"last_active": "2026-02-04T12:30:00+00:00",
"persona": "default",
"pid": null,
"status": "suspended",
"idle_timeout": 600 # seconds (10 minutes)
}
# /timeout command handler
async def timeout_cmd(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""Set idle timeout for active session."""
if not context.args:
# Show current timeout
active = session_manager.get_active_session()
if not active:
await update.message.reply_text("No active session")
return
metadata = session_manager.get_session(active)
timeout = metadata.get('idle_timeout', 600)
await update.message.reply_text(
f"Current idle timeout: {timeout // 60} minutes\n\n"
f"Usage: /timeout <minutes>"
)
return
# Parse timeout value
try:
minutes = int(context.args[0])
if minutes < 1 or minutes > 120:
await update.message.reply_text("Timeout must be between 1 and 120 minutes")
return
timeout_seconds = minutes * 60
except ValueError:
await update.message.reply_text("Invalid number. Usage: /timeout <minutes>")
return
# Update session metadata
active = session_manager.get_active_session()
session_manager.update_session(active, idle_timeout=timeout_seconds)
# Restart idle timer with new timeout
if active in idle_timers:
idle_timers[active].timeout_seconds = timeout_seconds
idle_timers[active].reset()
await update.message.reply_text(f"Idle timeout set to {minutes} minutes")
```
### Pattern 6: /sessions Command with Status Display
**What:** List all sessions with name, status, persona, last active time, sorted by activity
**When to use:** User wants to see session overview
**Example:**
```python
async def sessions_cmd(update: Update, context: ContextTypes.DEFAULT_TYPE):
"""List all sessions sorted by last activity."""
sessions = session_manager.list_sessions()
if not sessions:
await update.message.reply_text("No sessions found. Use /new <name> to create one.")
return
active_session = session_manager.get_active_session()
# Build formatted list
lines = ["*Sessions:*\n"]
for session in sessions: # Already sorted by last_active
name = session['name']
status = session['status']
persona = session.get('persona', 'default')
last_active = session.get('last_active', 'unknown')
# Format timestamp
try:
dt = datetime.fromisoformat(last_active)
time_str = dt.strftime('%Y-%m-%d %H:%M')
except:
time_str = 'unknown'
# Mark active session
marker = "→ " if name == active_session else " "
# Status emoji
emoji = "🟢" if status == "active" else "🔵" if status == "idle" else "⚪"
lines.append(
f"{marker}{emoji} `{name}` ({persona})\n"
f" {time_str}"
)
await update.message.reply_text("\n".join(lines), parse_mode='Markdown')
```
### Anti-Patterns to Avoid
- **Suspending during processing:** Never suspend while `subprocess.is_busy` is True — will lose in-progress work
- **Not resetting timer on user message:** If idle timer only resets on completion, user's message during timeout window gets ignored
- **Zombie processes on bot crash:** Without signal handlers, subprocess outlives bot and becomes zombie (orphaned)
- **SIGSTOP without resource consideration:** Paused processes hold memory, file handles, network sockets — unsafe for long idle periods
- **Shared idle timer for all sessions:** Different sessions have different needs (task vs chat), per-session timeout is more flexible
## Don't Hand-Roll
Problems that look simple but have existing solutions:
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Idle timeout detection | Manual timestamp checks in loop | asyncio.Event + asyncio.sleep() | Event-based cancellation is cleaner, no polling overhead |
| Graceful shutdown | Just process.terminate() | SIGTERM + timeout + SIGKILL pattern | Prevents zombie processes, handles hung processes |
| Per-object timers | Single global timeout thread | asyncio.create_task per session | Native async integration, automatic cleanup |
| Resume conversation | Manual state serialization | Claude Code --continue flag | Built-in, tested, handles all edge cases |
**Key insight:** Process lifecycle management has subtle races (subprocess dies mid-shutdown, signal arrives during cleanup, timer fires after cancellation). Using battle-tested patterns (signal handlers, timeout with fallback, event-based cancellation) prevents these races. Don't reinvent async subprocess management.
## Common Pitfalls
### Pitfall 1: Race Between Timer Fire and User Message
**What goes wrong:** Idle timer fires (subprocess terminated), user message arrives during termination, new subprocess spawns, old one still dying — two subprocesses running
**Why it happens:** Timer callback and message handler run concurrently. No synchronization between timer firing and subprocess state change.
**How to avoid:** Use asyncio.Lock around subprocess state transitions (terminate, spawn). Timer callback acquires lock before terminating, message handler acquires lock before spawning.
**Warning signs:** Duplicate responses, sessions becoming unresponsive, "subprocess already running" errors
```python
# WRONG - No synchronization
async def on_timeout(session_name):
await terminate_subprocess(session_name)
async def on_message(session_name, message):
subprocess = await spawn_subprocess(session_name)
await subprocess.send_message(message)
# RIGHT - Lock around transitions
subprocess_locks: dict[str, asyncio.Lock] = {}
async def on_timeout(session_name):
async with subprocess_locks[session_name]:
await terminate_subprocess(session_name)
async def on_message(session_name, message):
async with subprocess_locks[session_name]:
if not subprocess_exists(session_name):
await spawn_subprocess(session_name)
await subprocess.send_message(message)
```
### Pitfall 2: Terminating Subprocess During Tool Execution
**What goes wrong:** Claude is running a long tool (git clone, npm install), idle timer fires, subprocess terminated mid-operation, corrupted state
**Why it happens:** Idle timer only checks elapsed time since last message, doesn't check if subprocess is actively executing tools.
**How to avoid:** Track subprocess busy state (`is_busy` flag set during processing). Only start idle timer after `on_complete` callback fires (subprocess is truly idle).
**Warning signs:** Corrupted git repos, partial file writes, timeout errors from tools
```python
# WRONG - Timer starts immediately after message send
await subprocess.send_message(message)
idle_timers[session_name].reset() # Bad: Claude still processing
# RIGHT - Timer starts after completion
await subprocess.send_message(message)
# ... subprocess processes, calls tools, emits result event ...
# on_complete callback fires
async def on_complete():
idle_timers[session_name].reset() # Good: Claude is truly idle
```
### Pitfall 3: Not Canceling Idle Timer on Session Switch
**What goes wrong:** Switch from session A to session B, session A's timer fires 5 minutes later, terminates session A subprocess (which might have been switched back to)
**Why it happens:** Session switch doesn't cancel old session's timer, timer continues running independently
**How to avoid:** When switching sessions, don't cancel old timer — let it run. Old subprocess suspends on its own timer. This allows multiple concurrent sessions with independent lifetimes.
**Warning signs:** Sessions suspend unexpectedly after switching away and back
```python
# CORRECT - Don't cancel old timer on switch
async def switch_session(new_session_name):
old_session = get_active_session()
# Don't touch old session's timer - let it suspend naturally
# if old_session in idle_timers:
# idle_timers[old_session].cancel() # NO
set_active_session(new_session_name)
# Start new session's timer if needed
if new_session_name not in idle_timers:
# Create timer for new session
pass
```
### Pitfall 4: Subprocess Outlives Bot on Crash
**What goes wrong:** Bot crashes or is killed with SIGKILL, signal handlers never run, subprocesses become orphans, eat memory/CPU
**Why it happens:** SIGKILL can't be caught (by design), no cleanup code runs
**How to avoid:** Can't prevent SIGKILL zombies, but minimize with: (1) Store PID in session metadata, check on bot restart, (2) Use systemd with KillMode=control-group to kill all child processes, (3) Bot startup cleanup: scan for orphaned pids from metadata
**Warning signs:** Multiple claude processes running after bot restart, memory usage grows over time
```python
# Startup cleanup - kill orphaned subprocesses
async def cleanup_orphaned_subprocesses():
"""Kill any subprocesses that outlived previous bot run."""
sessions = session_manager.list_sessions()
for session in sessions:
pid = session.get('pid')
if pid:
# Check if process still exists
try:
os.kill(pid, 0) # Signal 0 = check existence
# Process exists - kill it
logger.warning(f"Killing orphaned subprocess: PID {pid}")
os.kill(pid, signal.SIGTERM)
await asyncio.sleep(2)
try:
os.kill(pid, signal.SIGKILL)
except ProcessLookupError:
pass # Already dead
except ProcessLookupError:
pass # Already dead
# Clear PID from metadata
session_manager.update_session(session['name'], pid=None, status='suspended')
```
### Pitfall 5: Storing Stale PIDs in Metadata
**What goes wrong:** Session metadata shows pid=12345, but subprocess already terminated. On bot restart, try to kill PID 12345 which is now a different process.
**Why it happens:** Subprocess crashes or is manually killed, metadata not updated
**How to avoid:** Clear PID from metadata when subprocess terminates (exit code detected). Before killing PID from metadata, verify it's a claude process (check /proc/{pid}/cmdline on Linux).
**Warning signs:** Bot kills wrong processes on restart, random crashes
```python
# Safe PID cleanup with verification
async def kill_subprocess_by_pid(pid: int):
"""Kill subprocess with PID verification."""
try:
# Verify it's a claude process (Linux-specific)
cmdline_path = f"/proc/{pid}/cmdline"
if os.path.exists(cmdline_path):
with open(cmdline_path) as f:
cmdline = f.read()
if 'claude' not in cmdline:
logger.warning(f"PID {pid} is not a claude process: {cmdline}")
return # Don't kill
# Kill the process
os.kill(pid, signal.SIGTERM)
await asyncio.sleep(2)
try:
os.kill(pid, signal.SIGKILL)
except ProcessLookupError:
pass
except ProcessLookupError:
pass # Already dead
except Exception as e:
logger.error(f"Error killing PID {pid}: {e}")
```
## Code Examples
Verified patterns from official sources:
### Complete Idle Timer Implementation
```python
# Source: https://docs.python.org/3/library/asyncio-task.html
import asyncio
from datetime import datetime, timezone
from typing import Callable, Optional
class SessionIdleTimer:
"""
Per-session idle timeout manager.
Tracks last activity, spawns background task to fire after timeout.
Cancels and restarts timer on activity (reset).
"""
def __init__(
self,
session_name: str,
timeout_seconds: int,
on_timeout: Callable[[str], None]
):
"""
Args:
session_name: Session identifier
timeout_seconds: Idle seconds before firing
on_timeout: Async callback(session_name) to invoke on timeout
"""
self.session_name = session_name
self.timeout_seconds = timeout_seconds
self.on_timeout = on_timeout
self._timer_task: Optional[asyncio.Task] = None
self._last_activity = datetime.now(timezone.utc)
def reset(self):
"""Reset timer on activity (user message or completion)."""
self._last_activity = datetime.now(timezone.utc)
# Cancel existing timer
if self._timer_task and not self._timer_task.done():
self._timer_task.cancel()
# Start fresh timer
self._timer_task = asyncio.create_task(self._wait_for_timeout())
async def _wait_for_timeout(self):
"""Background task that waits for timeout duration."""
try:
await asyncio.sleep(self.timeout_seconds)
# Timeout reached - invoke callback
await self.on_timeout(self.session_name)
except asyncio.CancelledError:
# Timer was reset by activity
pass
def cancel(self):
"""Cancel timer on session shutdown."""
if self._timer_task and not self._timer_task.done():
self._timer_task.cancel()
@property
def seconds_since_activity(self) -> float:
"""Get seconds elapsed since last activity."""
delta = datetime.now(timezone.utc) - self._last_activity
return delta.total_seconds()
```
### Graceful Subprocess Termination with Timeout
```python
# Source: https://roguelynn.com/words/asyncio-graceful-shutdowns/
import asyncio
import signal
import logging
logger = logging.getLogger(__name__)
async def terminate_subprocess_gracefully(
process: asyncio.subprocess.Process,
timeout: int = 5
) -> None:
"""
Terminate subprocess with graceful shutdown sequence.
1. Close stdin (signal end of input)
2. Send SIGTERM (request graceful shutdown)
3. Wait up to timeout seconds
4. Send SIGKILL if still running (force kill)
5. Always reap process (prevent zombie)
Args:
process: asyncio subprocess to terminate
timeout: Seconds to wait before SIGKILL
"""
if not process or process.returncode is not None:
logger.debug("Process already terminated")
return
pid = process.pid
logger.info(f"Terminating subprocess PID {pid}")
try:
# Close stdin to signal no more input
if process.stdin and not process.stdin.is_closing():
process.stdin.close()
await process.stdin.wait_closed()
# Send SIGTERM for graceful exit
process.terminate()
# Wait for clean exit with timeout
try:
await asyncio.wait_for(process.wait(), timeout=timeout)
logger.info(f"Process {pid} terminated gracefully")
except asyncio.TimeoutError:
# Timeout - force kill
logger.warning(f"Process {pid} did not exit within {timeout}s, sending SIGKILL")
process.kill()
await process.wait() # CRITICAL: Reap to prevent zombie
logger.info(f"Process {pid} killed")
except Exception as e:
logger.error(f"Error terminating process {pid}: {e}")
# Last resort force kill
try:
process.kill()
await process.wait()
except:
pass
```
### Bot Shutdown Signal Handler
```python
# Source: https://roguelynn.com/words/asyncio-graceful-shutdowns/ +
# https://github.com/wbenny/python-graceful-shutdown
import signal
import asyncio
import logging
logger = logging.getLogger(__name__)
async def shutdown_handler(
sig: signal.Signals,
loop: asyncio.AbstractEventLoop,
idle_timers: dict,
subprocesses: dict
):
"""
Graceful shutdown handler for bot.
Invoked on SIGTERM/SIGINT to clean up before exit.
Steps:
1. Log signal received
2. Cancel all idle timers
3. Terminate all subprocesses with timeout
4. Cancel all other asyncio tasks
5. Stop event loop
Args:
sig: Signal that triggered shutdown
loop: Event loop to stop
idle_timers: Dict of SessionIdleTimer objects
subprocesses: Dict of ClaudeSubprocess objects
"""
logger.info(f"Received exit signal {sig.name}, initiating graceful shutdown")
# Step 1: Cancel all idle timers
logger.info("Canceling idle timers...")
for session_name, timer in idle_timers.items():
timer.cancel()
# Step 2: Terminate all active subprocesses
logger.info("Terminating subprocesses...")
termination_tasks = []
for session_name, subprocess in subprocesses.items():
if subprocess.is_alive:
logger.info(f"Terminating subprocess for '{session_name}'")
termination_tasks.append(
terminate_subprocess_gracefully(subprocess._process, timeout=5)
)
# Wait for all terminations (with exceptions handled)
if termination_tasks:
await asyncio.gather(*termination_tasks, return_exceptions=True)
# Step 3: Cancel all other asyncio tasks
logger.info("Canceling remaining tasks...")
tasks = [t for t in asyncio.all_tasks() if t is not asyncio.current_task()]
for task in tasks:
task.cancel()
# Wait for cancellations, ignore exceptions
await asyncio.gather(*tasks, return_exceptions=True)
# Step 4: Stop event loop
logger.info("Stopping event loop")
loop.stop()
# Install signal handlers in main()
def main():
"""Bot entry point with signal handler installation."""
app = Application.builder().token(TOKEN).build()
# Get event loop
loop = asyncio.get_event_loop()
# Install signal handlers for graceful shutdown
signals_to_handle = (signal.SIGTERM, signal.SIGINT)
for sig in signals_to_handle:
loop.add_signal_handler(
sig,
lambda s=sig: asyncio.create_task(
shutdown_handler(s, loop, idle_timers, subprocesses)
)
)
logger.info("Signal handlers installed")
# Start bot
app.run_polling()
```
### Session Resume with Status Message
```python
# Source: https://code.claude.com/docs/en/cli-reference
from datetime import datetime, timezone
async def resume_suspended_session(
bot,
chat_id: int,
session_name: str,
message: str
) -> None:
"""
Resume suspended session and send message.
Sends brief status message to user, spawns subprocess with --continue,
sends user's message to Claude.
Args:
bot: Telegram bot instance
chat_id: Telegram chat ID
session_name: Session to resume
message: User message to send after resume
"""
metadata = session_manager.get_session(session_name)
# Calculate idle duration
last_active = datetime.fromisoformat(metadata['last_active'])
now = datetime.now(timezone.utc)
idle_minutes = (now - last_active).total_seconds() / 60
# Send status message
if idle_minutes > 1:
status_text = f"Resuming session (idle for {int(idle_minutes)} min)..."
else:
status_text = "Resuming session..."
await bot.send_message(chat_id=chat_id, text=status_text)
# Spawn subprocess with --continue
session_dir = session_manager.get_session_dir(session_name)
persona = load_persona_for_session(session_name)
callbacks = make_callbacks(bot, chat_id, session_name)
subprocess = ClaudeSubprocess(
session_dir=session_dir,
persona=persona,
on_output=callbacks['on_output'],
on_error=callbacks['on_error'],
on_complete=lambda: on_completion(session_name),
on_status=callbacks['on_status'],
on_tool_use=callbacks['on_tool_use'],
)
await subprocess.start()
subprocesses[session_name] = subprocess
# Update metadata
session_manager.update_session(
session_name,
status='active',
last_active=now.isoformat(),
pid=subprocess._process.pid
)
# Send user's message
await subprocess.send_message(message)
# Start idle timer
timeout = metadata.get('idle_timeout', 600)
idle_timers[session_name] = SessionIdleTimer(
session_name,
timeout,
on_timeout=suspend_session
)
```
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| Manual timestamp polling | asyncio.Event + asyncio.sleep() | asyncio maturity (2020+) | Cleaner cancellation, no polling overhead |
| SIGKILL only | SIGTERM + timeout + SIGKILL fallback | Best practice evolution (2018+) | Prevents zombie processes, allows cleanup |
| Global timeout thread | Per-object asyncio tasks | Modern asyncio patterns (2022+) | Per-session configuration, native async integration |
| Manual state files | Claude Code --continue with .claude/ | Claude Code 2.0+ (2024) | Built-in, tested, handles edge cases |
| SIGSTOP/SIGCONT | SIGTERM + restart | Resource efficiency awareness (ongoing) | Releases memory during idle, safer for long periods |
**Deprecated/outdated:**
- **Thread-based timers for async code:** Mixing threading with asyncio adds complexity, use asyncio.create_task
- **Blocking time.sleep() in async context:** Use asyncio.sleep() instead
- **Not reaping terminated subprocesses:** Always call process.wait() to prevent zombies
## Open Questions
Things that couldn't be fully resolved:
1. **Optimal default idle timeout**
- What we know: Common ranges are 5-15 minutes for chat bots, longer for task automation
- What's unclear: What's the sweet spot for balancing memory usage vs restart friction?
- Recommendation: Start with 10 minutes default. Allow per-session override via /timeout. Monitor actual usage patterns and adjust.
2. **SIGSTOP/SIGCONT vs SIGTERM tradeoff**
- What we know: SIGSTOP keeps memory but saves restart cost (~1s), SIGTERM releases memory but costs CPU
- What's unclear: At what idle duration does memory savings outweigh restart cost?
- Recommendation: Use SIGTERM approach. Memory release is more important than 1s restart cost. Claude processes can grow large (100-500MB) with long conversations. SIGSTOP is only beneficial for <5min idle periods.
3. **Resume status message verbosity**
- What we know: User decision says "brief status message on resume"
- What's unclear: Should it show idle duration? Session name? Model?
- Recommendation: Show idle duration if >1 minute ("Resuming session (idle for 15 min)..."). Don't show session name (user knows what session they messaged). Keep brief.
4. **Multi-session concurrent subprocess limit**
- What we know: Multiple sessions can have live subprocesses simultaneously
- What's unclear: Should there be a cap? What if user has 20 sessions all active?
- Recommendation: No hard cap initially. Each subprocess uses ~100-500MB. On an 8GB system, 10-20 concurrent sessions is reasonable. Add warning in /sessions if >10 active. Add global concurrent limit (e.g., 15) in Phase 4 if needed.
5. **Session switch behavior for previous subprocess**
- What we know: User decision says "switching leaves previous subprocess running"
- What's unclear: Should switching reset the previous session's idle timer?
- Recommendation: Don't reset on switch. Previous session's timer continues from last activity. If it was idle for 8 minutes when you switched away, it will suspend in 2 more minutes. This is intuitive — switching doesn't "touch" the old session.
## Sources
### Primary (HIGH confidence)
- [Coroutines and Tasks - Python 3.14.3 Documentation](https://docs.python.org/3/library/asyncio-task.html) - Official asyncio timeout and task management
- [CLI reference - Claude Code Docs](https://code.claude.com/docs/en/cli-reference) - Official Claude Code --continue flag documentation
- [Graceful Shutdowns with asyncio - roguelynn](https://roguelynn.com/words/asyncio-graceful-shutdowns/) - Signal handlers and shutdown orchestration
- [python-graceful-shutdown - GitHub](https://github.com/wbenny/python-graceful-shutdown) - Complete example of shutdown patterns
- [Stopping and Resuming Processes with SIGSTOP and SIGCONT - TheLinuxCode](https://thelinuxcode.com/stop-process-using-sigstop-signal-linux/) - SIGSTOP/SIGCONT behavior and resource tradeoffs
### Secondary (MEDIUM confidence)
- [Session Management - Claude API Docs](https://platform.claude.com/docs/en/agent-sdk/sessions) - Session persistence patterns
- [SIGTERM, SIGKILL & SIGSTOP Signals - Medium](https://medium.com/@4techusage/sigterm-sigkill-sigstop-signals-63cb919431e8) - Signal comparison
- [A Complete Guide to Timeouts in Python - Better Stack](https://betterstack.com/community/guides/scaling-python/python-timeouts/) - Timeout mechanisms in Python
### Tertiary (LOW confidence)
- WebSearch results on asyncio subprocess management and idle detection patterns - Multiple sources, cross-referenced
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH - All stdlib components, Claude Code CLI verified
- Architecture: HIGH - Patterns based on official asyncio docs and battle-tested libraries
- Pitfalls: MEDIUM-HIGH - Common races and edge cases documented, some based on general async patterns rather than lifecycle-specific sources
**Research date:** 2026-02-04
**Valid until:** 2026-03-04 (30 days - asyncio stdlib is stable, Claude Code --continue is established)

View file

@ -1,801 +0,0 @@
# Architecture Research
**Domain:** Telegram Bot with Claude Code CLI Session Management
**Researched:** 2026-02-04
**Confidence:** HIGH
## Standard Architecture
### System Overview
```
┌─────────────────────────────────────────────────────────────────────┐
│ Telegram API (External) │
└────────────────────────────────┬────────────────────────────────────┘
│ (webhooks or polling)
┌─────────────────────────────────────────────────────────────────────┐
│ Bot Event Loop (asyncio) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Message │ │ Photo │ │ Document │ │
│ │ Handler │ │ Handler │ │ Handler │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ ↓ │
│ ┌─────────────────┐ │
│ │ Route to │ │
│ │ Session │ │
│ │ (path-based) │ │
│ └────────┬────────┘ │
└────────────────────────────┼─────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Session Manager │
│ │
│ ~/telegram/sessions/<session_name>/ │
│ ├── metadata.json (state, timestamps, config) │
│ ├── conversation.jsonl (message history) │
│ ├── images/ (attachments) │
│ ├── files/ (documents) │
│ └── .claude_session_id (Claude session ID for --resume) │
│ │
│ Session States: │
│ [IDLE] → [SPAWNING] → [ACTIVE] → [IDLE] → [SUSPENDED] │
│ │
│ Idle Timeout: 10 minutes of inactivity → graceful suspend │
│ │
└────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Process Manager (per session) │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Claude Code CLI Process (subprocess) │ │
│ │ │ │
│ │ Command: claude --resume <session_id> \ │ │
│ │ --model haiku \ │ │
│ │ --output-format stream-json \ │ │
│ │ --input-format stream-json \ │ │
│ │ --no-interactive \ │ │
│ │ --dangerously-skip-permissions │ │
│ │ │ │
│ │ stdin ←─────── Message Queue (async) │ │
│ │ stdout ─────→ Response Buffer (async readline) │ │
│ │ stderr ─────→ Error Logger │ │
│ │ │ │
│ │ State: RUNNING | PROCESSING | IDLE | TERMINATED │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ Process lifecycle: │
│ 1. create_subprocess_exec() with PIPE streams │
│ 2. asyncio tasks for stdout reader + stderr reader │
│ 3. Message queue feeds stdin writer │
│ 4. Idle timeout monitor (background task) │
│ 5. Graceful shutdown: close stdin, await process.wait() │
│ │
└────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Response Router │
│ │
│ Parses Claude Code --output-format stream-json: │
│ {"type": "text", "content": "..."} │
│ {"type": "tool_use", "name": "Read", "input": {...}} │
│ {"type": "tool_result", "tool_use_id": "...", "content": "..."} │
│ │
│ Routes output back to Telegram: │
│ - Buffers text chunks until complete message │
│ - Formats code blocks with Markdown │
│ - Splits long messages (4096 char Telegram limit) │
│ - Sends images via bot.send_photo() if Claude generates files │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
### Component Responsibilities
| Component | Responsibility | Typical Implementation |
|-----------|----------------|------------------------|
| **Bot Event Loop** | Receives Telegram updates (messages, photos, documents), dispatches to handlers | `python-telegram-bot` Application with async handlers |
| **Message Router** | Maps Telegram chat_id to session path, creates session if needed, loads/saves metadata | Path-based directory structure: `~/telegram/sessions/<name>/` |
| **Session Manager** | Owns session lifecycle: create, load, update metadata, check idle timeout, suspend/resume | Python class with async methods, uses file locks for concurrency safety |
| **Process Manager** | Spawns/manages Claude Code CLI subprocess per session, handles stdin/stdout/stderr streams | `asyncio.create_subprocess_exec()` with PIPE streams, background reader tasks |
| **Message Queue** | Buffers incoming messages from Telegram, feeds to Claude stdin as stream-json | `asyncio.Queue` per session, async writer task |
| **Response Buffer** | Reads stdout line-by-line, parses stream-json, accumulates text chunks | Async reader task with `process.stdout.readline()`, JSON parsing |
| **Response Router** | Formats Claude output for Telegram (Markdown, code blocks, chunking), sends via bot API | Telegram formatting helpers, message splitting logic |
| **Idle Monitor** | Tracks last activity timestamp per session, triggers graceful shutdown after timeout | Background `asyncio.Task` checking timestamps, calls suspend on timeout |
| **Cost Monitor** | Routes to Haiku for monitoring commands (/status, /pbs), switches to Opus for conversational messages | Model selection logic based on message type (command vs. text) |
## Recommended Project Structure
```
telegram/
├── bot.py # Main entry point (systemd service)
├── credentials # Bot token (existing)
├── authorized_users # Allowed chat IDs (existing)
├── inbox # Old single-session inbox (deprecated, remove after migration)
├── images/ # Old images dir (deprecated)
├── files/ # Old files dir (deprecated)
├── sessions/ # NEW: Multi-session storage
│ ├── main/ # Default session
│ │ ├── metadata.json
│ │ ├── conversation.jsonl
│ │ ├── images/
│ │ ├── files/
│ │ └── .claude_session_id
│ │
│ ├── homelab/ # Path-based session example
│ │ └── ...
│ │
│ └── dev/ # Another session
│ └── ...
└── lib/ # NEW: Modularized code
├── __init__.py
├── router.py # Message routing logic (chat_id → session)
├── session.py # Session class (metadata, state, paths)
├── process_manager.py # ProcessManager class (spawn, communicate, monitor)
├── stream_parser.py # Claude stream-json parser
├── telegram_formatter.py # Telegram response formatting
├── idle_monitor.py # Idle timeout background task
└── cost_optimizer.py # Model selection (Haiku vs Opus)
```
### Structure Rationale
- **sessions/ directory:** Path-based isolation, one directory per conversation context. Allows multiple simultaneous sessions without state bleeding. Each session directory is self-contained for easy inspection, backup, and debugging.
- **lib/ modularization:** Current bot.py is 375 lines with single-session logic. Multi-session with subprocess management will easily exceed 1000+ lines. Breaking into modules improves testability, readability, and allows incremental development.
- **Metadata files:** `metadata.json` stores session state (IDLE/ACTIVE/SUSPENDED), last activity timestamp, Claude session ID, and configuration (model choice, custom prompts). `conversation.jsonl` is append-only message log (one JSON object per line) for audit trail and potential Claude context replay.
- **Separation of concerns:** Each module has one job. Router doesn't know about processes. ProcessManager doesn't know about Telegram. Session class is pure data structure. This enables testing each component in isolation.
## Architectural Patterns
### Pattern 1: Path-Based Session Routing
**What:** Map Telegram chat_id to filesystem path `~/telegram/sessions/<name>/` to isolate conversation contexts. Session name derived from explicit user command (`/session <name>`) or defaults to "main".
**When to use:** When a single bot needs to maintain multiple independent conversation contexts for the same user (e.g., "homelab" for infrastructure work, "dev" for coding, "personal" for notes).
**Trade-offs:**
- **Pro:** Filesystem provides natural isolation, easy to inspect/backup/delete sessions, no database needed
- **Pro:** Path-based routing is conceptually simple and debuggable
- **Con:** File locks needed for concurrent access (though Telegram updates are sequential per chat_id)
- **Con:** Large number of sessions (1000+) could strain filesystem if poorly managed
**Example:**
```python
# router.py
class SessionRouter:
def __init__(self, base_path: Path):
self.base_path = base_path
self.chat_sessions = {} # chat_id → current session_name
def get_session_path(self, chat_id: int) -> Path:
"""Get current session path for chat_id."""
session_name = self.chat_sessions.get(chat_id, "main")
path = self.base_path / session_name
path.mkdir(parents=True, exist_ok=True)
return path
def switch_session(self, chat_id: int, session_name: str):
"""Switch chat_id to a different session."""
self.chat_sessions[chat_id] = session_name
```
### Pattern 2: Async Subprocess with Bidirectional Streams
**What:** Use `asyncio.create_subprocess_exec()` with PIPE streams for stdin/stdout/stderr. Launch separate async tasks for reading stdout and stderr to avoid deadlocks. Feed stdin via async queue.
**When to use:** When you need to interact with a long-running interactive CLI tool (like Claude Code) that reads from stdin and writes to stdout continuously.
**Trade-offs:**
- **Pro:** Python's asyncio subprocess module handles complex stream management
- **Pro:** Non-blocking I/O allows bot to remain responsive while Claude processes
- **Pro:** Separate reader tasks prevent buffer-full deadlocks
- **Con:** More complex than simple `subprocess.run()` or `communicate()`
- **Con:** Must manually manage process lifecycle (startup, shutdown, crashes)
**Example:**
```python
# process_manager.py
class ProcessManager:
async def spawn_claude(self, session_id: str, model: str = "haiku"):
"""Spawn Claude Code CLI subprocess."""
self.process = await asyncio.create_subprocess_exec(
"claude",
"--resume", session_id,
"--model", model,
"--output-format", "stream-json",
"--input-format", "stream-json",
"--no-interactive",
"--dangerously-skip-permissions",
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
# Launch reader tasks
self.stdout_task = asyncio.create_task(self._read_stdout())
self.stderr_task = asyncio.create_task(self._read_stderr())
self.state = "RUNNING"
async def _read_stdout(self):
"""Read stdout line-by-line, parse stream-json."""
while True:
line = await self.process.stdout.readline()
if not line:
break # EOF
try:
event = json.loads(line.decode())
await self.output_queue.put(event)
except json.JSONDecodeError as e:
logger.error(f"Failed to parse Claude output: {e}")
async def _read_stderr(self):
"""Log stderr output."""
while True:
line = await self.process.stderr.readline()
if not line:
break
logger.warning(f"Claude stderr: {line.decode().strip()}")
async def send_message(self, message: str):
"""Send message to Claude stdin as stream-json."""
event = {"type": "message", "content": message}
json_line = json.dumps(event) + "\n"
self.process.stdin.write(json_line.encode())
await self.process.stdin.drain()
```
### Pattern 3: State Machine for Session Lifecycle
**What:** Define explicit states for each session (IDLE, SPAWNING, ACTIVE, PROCESSING, SUSPENDED) with transitions based on events (message_received, response_sent, timeout_reached, user_command).
**When to use:** When managing complex lifecycle with timeouts, retries, and graceful shutdowns. State machine makes transitions explicit and debuggable.
**Trade-offs:**
- **Pro:** Clear semantics for what can happen in each state
- **Pro:** Easier to add new states (e.g., PAUSED, ERROR) without breaking existing logic
- **Pro:** Testable: can unit test state transitions independently
- **Con:** Overhead for simple cases (but this is not a simple case)
- **Con:** Requires discipline to update state consistently
**Example:**
```python
# session.py
from enum import Enum
class SessionState(Enum):
IDLE = "idle" # No process running, session directory exists
SPAWNING = "spawning" # Process being created
ACTIVE = "active" # Process running, waiting for input
PROCESSING = "processing" # Process running, handling a message
SUSPENDED = "suspended" # Timed out, process terminated, state saved
class Session:
def __init__(self, path: Path):
self.path = path
self.state = SessionState.IDLE
self.last_activity = datetime.now()
self.process_manager = None
self.claude_session_id = self._load_claude_session_id()
async def transition(self, new_state: SessionState):
"""Transition to new state with logging."""
logger.info(f"Session {self.path.name}: {self.state.value} → {new_state.value}")
self.state = new_state
self._save_metadata()
async def handle_message(self, message: str):
"""Main message handling logic."""
self.last_activity = datetime.now()
if self.state == SessionState.IDLE:
await self.transition(SessionState.SPAWNING)
await self._spawn_process()
await self.transition(SessionState.ACTIVE)
if self.state == SessionState.ACTIVE:
await self.transition(SessionState.PROCESSING)
await self.process_manager.send_message(message)
# Wait for response, transition back to ACTIVE when done
async def check_idle_timeout(self, timeout_seconds: int = 600):
"""Check if session should be suspended."""
if self.state in [SessionState.ACTIVE, SessionState.PROCESSING]:
idle_time = (datetime.now() - self.last_activity).total_seconds()
if idle_time > timeout_seconds:
await self.suspend()
async def suspend(self):
"""Gracefully shut down process, save state."""
if self.process_manager:
await self.process_manager.shutdown()
await self.transition(SessionState.SUSPENDED)
```
### Pattern 4: Cost Optimization with Model Switching
**What:** Use Haiku (cheap, fast) for monitoring commands that invoke helper scripts (`/status`, `/pbs`, `/beszel`). Switch to Opus (expensive, smart) for open-ended conversational messages.
**When to use:** When cost is a concern and some tasks don't need the most capable model.
**Trade-offs:**
- **Pro:** Significant cost savings (Haiku is 100x cheaper than Opus per million tokens)
- **Pro:** Faster responses for simple monitoring queries
- **Con:** Need to maintain routing logic for which messages use which model
- **Con:** Risk of using wrong model if classification is incorrect
**Example:**
```python
# cost_optimizer.py
class ModelSelector:
MONITORING_COMMANDS = {"/status", "/pbs", "/backups", "/beszel", "/kuma", "/ping"}
@staticmethod
def select_model(message: str) -> str:
"""Choose model based on message type."""
# Command messages use Haiku
if message.strip().startswith("/") and message.split()[0] in ModelSelector.MONITORING_COMMANDS:
return "haiku"
# Conversational messages use Opus
return "opus"
@staticmethod
async def spawn_with_model(session: Session, message: str):
"""Spawn Claude process with appropriate model."""
model = ModelSelector.select_model(message)
logger.info(f"Spawning Claude with model: {model}")
await session.process_manager.spawn_claude(
session_id=session.claude_session_id,
model=model
)
```
## Data Flow
### Request Flow
```
[User sends message in Telegram]
[Bot receives Update via polling]
[MessageHandler extracts text, chat_id]
[SessionRouter maps chat_id → session_path]
[Load Session from filesystem (metadata.json)]
[Check session state]
┌───────────────────────────────────────┐
│ State: IDLE or SUSPENDED │
│ ↓ │
│ ModelSelector chooses Haiku or Opus │
│ ↓ │
│ ProcessManager spawns Claude CLI: │
│ claude --resume <session_id> \ │
│ --model <haiku|opus> \ │
│ --output-format stream-json │
│ ↓ │
│ Session transitions to ACTIVE │
└───────────────────────────────────────┘
[Format message as stream-json]
[Write to process.stdin, drain buffer]
[Session transitions to PROCESSING]
[Claude processes request...]
```
### Response Flow
```
[Claude writes to stdout (stream-json events)]
[AsyncIO reader task reads line-by-line]
[Parse JSON: {"type": "text", "content": "..."}]
[StreamParser accumulates text chunks]
[Detect end-of-response marker]
[ResponseFormatter applies Markdown, splits long messages]
[Send to Telegram via bot.send_message()]
[Session transitions to ACTIVE]
[Update last_activity timestamp]
[IdleMonitor background task checks timeout]
┌───────────────────────────────────────┐
│ If idle > 10 minutes: │
│ ↓ │
│ Session.suspend() │
│ ↓ │
│ ProcessManager.shutdown(): │
│ - close stdin │
│ - await process.wait(timeout=5s) │
│ - force kill if still running │
│ ↓ │
│ Session transitions to SUSPENDED │
│ ↓ │
│ Save metadata (state, timestamp) │
└───────────────────────────────────────┘
```
### Key Data Flows
1. **Message ingestion:** Telegram Update → Handler → Router → Session → ProcessManager → Claude stdin
- Async all the way, no blocking calls
- Each session has independent queue to avoid cross-session interference
2. **Response streaming:** Claude stdout → Reader task → StreamParser → Formatter → Telegram API
- Line-by-line reading prevents memory issues with large responses
- Chunking respects Telegram's 4096 character limit per message
3. **File attachments:** Telegram photo/document → Download to `sessions/<name>/images/` or `files/` → Log to conversation.jsonl → Available for Claude via file path
- When user sends photo, log path to conversation so next message can reference it
- Claude can read images via Read tool if path is mentioned
4. **Idle timeout:** Background task checks `last_activity` every 60 seconds → If >10 min idle → Trigger graceful shutdown
- Prevents zombie processes accumulating and consuming resources
- Session state saved to disk, resumes transparently when user returns
## Scaling Considerations
| Scale | Architecture Adjustments |
|-------|--------------------------|
| 1-5 users (current) | Single LXC container, filesystem-based sessions, no database needed. Idle timeout prevents resource exhaustion. |
| 5-20 users | Add session cleanup job (delete sessions inactive >30 days). Monitor disk space for sessions/ directory. Consider Redis for chat_id → session_name mapping if restarting bot frequently. |
| 20-100 users | Move session storage to separate ZFS dataset with quota. Add metrics (Prometheus) for session count, process count, API cost. Implement rate limiting per user. Consider dedicated container for bot. |
| 100+ users | Multi-bot deployment (shard by chat_id). Centralized session storage (S3/MinIO). Queue-based architecture (RabbitMQ) to decouple Telegram polling from processing. Separate Claude API keys per bot instance to avoid rate limits. |
### Scaling Priorities
1. **First bottleneck:** Disk I/O from many sessions writing conversation logs concurrently
- **Fix:** Use ZFS with compression, optimize writes (batch metadata updates, async file I/O)
2. **Second bottleneck:** Claude API rate limits (multiple users sending messages simultaneously)
- **Fix:** Queue messages per user, implement retry with exponential backoff, surface "API busy" message to user
3. **Third bottleneck:** Memory usage from many concurrent Claude processes (each process ~100-200MB)
- **Fix:** Aggressive idle timeout (reduce from 10min to 5min), limit max concurrent sessions, queue requests if too many processes
## Anti-Patterns
### Anti-Pattern 1: Blocking I/O in Async Context
**What people do:** Call blocking `subprocess.run()` or `open().read()` directly in async handlers, blocking the entire event loop.
**Why it's wrong:** Telegram bot uses async event loop. Blocking call freezes all handlers until it completes, making bot unresponsive to other users.
**Do this instead:** Use `asyncio.create_subprocess_exec()` for subprocess, `aiofiles` for file I/O, or wrap blocking calls in `asyncio.to_thread()` (Python 3.9+).
```python
# ❌ BAD: Blocks event loop
async def handle_message(update, context):
result = subprocess.run(["long-command"], capture_output=True) # Blocks!
await update.message.reply_text(result.stdout)
# ✅ GOOD: Non-blocking async subprocess
async def handle_message(update, context):
process = await asyncio.create_subprocess_exec(
"long-command",
stdout=asyncio.subprocess.PIPE
)
stdout, _ = await process.communicate()
await update.message.reply_text(stdout.decode())
```
### Anti-Pattern 2: Using communicate() for Interactive Processes
**What people do:** Spawn subprocess and call `await process.communicate(input=message)` for every message, expecting bidirectional interaction.
**Why it's wrong:** `communicate()` sends input, closes stdin, and waits for process to exit. It's designed for one-shot commands, not interactive sessions. Process exits after first response.
**Do this instead:** Keep process alive, manually manage stdin/stdout streams with separate reader/writer tasks. Never call `communicate()` on long-running processes.
```python
# ❌ BAD: Process exits after first message
async def send_message(self, message):
stdout, stderr = await self.process.communicate(input=message.encode())
# Process is now dead, must spawn again for next message
# ✅ GOOD: Keep process alive
async def send_message(self, message):
self.process.stdin.write(message.encode() + b"\n")
await self.process.stdin.drain()
# Process still running, can send more messages
```
### Anti-Pattern 3: Ignoring Idle Processes
**What people do:** Spawn subprocess when user sends message, never clean up when user goes idle. Accumulate processes indefinitely.
**Why it's wrong:** Each Claude process consumes memory (~100-200MB). With 20 users, that's 4GB of RAM wasted on idle sessions. Container OOM kills bot.
**Do this instead:** Implement idle timeout monitor. Track `last_activity` per session. Background task checks every 60s, suspends sessions idle >10min.
```python
# ✅ GOOD: Idle monitoring
class IdleMonitor:
async def monitor_loop(self, sessions: dict[str, Session]):
"""Background task to check idle timeouts."""
while True:
await asyncio.sleep(60) # Check every minute
for session in sessions.values():
if session.state in [SessionState.ACTIVE, SessionState.PROCESSING]:
idle_time = (datetime.now() - session.last_activity).total_seconds()
if idle_time > 600: # 10 minutes
logger.info(f"Suspending idle session: {session.path.name}")
await session.suspend()
```
### Anti-Pattern 4: Mixing Session State Across Chats
**What people do:** Use single global conversation history for all chats, or use chat_id as session identifier without allowing multiple sessions per user.
**Why it's wrong:** User can't maintain separate contexts (e.g., "homelab" session for infra, "dev" session for coding). All conversations bleed together, Claude gets confused by mixed context.
**Do this instead:** Implement path-based routing with explicit session names. Allow user to switch sessions with `/session <name>` command. Each session has independent filesystem directory and Claude session ID.
```python
# ✅ GOOD: Path-based session isolation
class SessionRouter:
def get_or_create_session(self, chat_id: int, session_name: str = "main") -> Session:
"""Get session by chat_id and name."""
key = f"{chat_id}:{session_name}"
if key not in self.active_sessions:
path = self.base_path / str(chat_id) / session_name
self.active_sessions[key] = Session(path)
return self.active_sessions[key]
```
## Integration Points
### External Services
| Service | Integration Pattern | Notes |
|---------|---------------------|-------|
| **Telegram Bot API** | Polling via `Application.run_polling()`, async handlers receive `Update` objects | Rate limit: 30 messages/second per bot. Use `python-telegram-bot` v21.8+ for native asyncio support. |
| **Claude Code CLI** | Subprocess invocation with `--output-format stream-json`, bidirectional stdin/stdout communication | Must use `--no-interactive` flag for programmatic usage. `--dangerously-skip-permissions` required to avoid prompts blocking stdin. |
| **Homelab Helper Scripts** | Called via subprocess by Claude when responding to monitoring commands (`/status``~/bin/pbs status`) | Claude has access via Bash tool. Output captured in stdout, returned to user. |
| **Filesystem (Sessions)** | Direct file I/O for metadata, conversation logs, attachments. Use `aiofiles` for async file operations | Append-only `conversation.jsonl` provides audit trail and potential replay capability. |
### Internal Boundaries
| Boundary | Communication | Notes |
|----------|---------------|-------|
| **Bot ↔ SessionRouter** | Function calls: `router.get_session(chat_id)` returns `Session` object | Router owns mapping of chat_id to session. Stateless, can be rebuilt from filesystem. |
| **SessionRouter ↔ Session** | Function calls: `session.handle_message(text)` async method | Session encapsulates state machine, owns ProcessManager. |
| **Session ↔ ProcessManager** | Function calls: `process_manager.spawn_claude()`, `send_message()`, `shutdown()` async methods | ProcessManager owns subprocess lifecycle. Session doesn't know about asyncio streams. |
| **ProcessManager ↔ Claude CLI** | OS pipes: stdin (write), stdout (read), stderr (read) | Never use `communicate()` for interactive processes. Manual stream management required. |
| **StreamParser ↔ ResponseFormatter** | Function calls: `parser.accumulate(event)` returns buffered text, `formatter.format_for_telegram(text)` returns list of message chunks | Parser handles stream-json protocol, Formatter handles Telegram-specific quirks (Markdown escaping, 4096 char limit). |
| **IdleMonitor ↔ Session** | Background task calls `session.check_idle_timeout()` periodically | Monitor is global background task, iterates over all active sessions. |
## Build Order and Dependencies
Based on the architecture, here's the suggested build order with dependency reasoning:
### Phase 1: Foundation (Sessions & Routing)
**Goal:** Establish multi-session filesystem structure without subprocess management yet.
1. **Session class** (`lib/session.py`)
- Implement metadata file format (JSON schema for state, timestamps, config)
- Implement path-based directory creation
- Add state enum and state machine skeleton (transitions without actions)
- Add conversation.jsonl append logging
- **No dependencies** - pure data structure
2. **SessionRouter** (`lib/router.py`)
- Implement chat_id → session_name mapping
- Implement session creation/loading
- Add command parsing for `/session <name>` to switch sessions
- **Depends on:** Session class
3. **Update bot.py**
- Integrate SessionRouter into existing handlers
- Route all messages through router to session
- Add `/session` command handler
- **Depends on:** SessionRouter
- **Testing:** Can test routing without Claude integration by just logging messages to conversation.jsonl
### Phase 2: Process Management (Claude CLI Integration)
**Goal:** Spawn and communicate with Claude Code subprocess.
4. **StreamParser** (`lib/stream_parser.py`)
- Implement stream-json parsing (line-by-line JSON objects)
- Handle {"type": "text", "content": "..."} events
- Accumulate text chunks into complete messages
- **No dependencies** - pure parser
5. **ProcessManager** (`lib/process_manager.py`)
- Implement `spawn_claude()` with `asyncio.create_subprocess_exec()`
- Implement async stdout reader task using StreamParser
- Implement async stderr reader task for logging
- Implement `send_message()` to write stdin
- Implement graceful `shutdown()` (close stdin, wait, force kill if hung)
- **Depends on:** StreamParser
6. **Integrate ProcessManager into Session**
- Update state machine to spawn process on first message (IDLE → SPAWNING → ACTIVE)
- Implement `handle_message()` to pipe to ProcessManager
- Add response buffering and state transitions (PROCESSING → ACTIVE)
- **Depends on:** ProcessManager
- **Testing:** Send message to session, verify Claude responds, check process terminates on shutdown
### Phase 3: Response Formatting & Telegram Integration
**Goal:** Format Claude output for Telegram and handle attachments.
7. **TelegramFormatter** (`lib/telegram_formatter.py`)
- Implement Markdown escaping for Telegram Bot API
- Implement message chunking (4096 char limit)
- Implement code block detection and formatting
- **No dependencies** - pure formatter
8. **Update Session to use formatter**
- Pipe ProcessManager output through TelegramFormatter
- Send formatted chunks to Telegram via bot API
- **Depends on:** TelegramFormatter
9. **File attachment handling**
- Update photo/document handlers to save to session-specific paths
- Log file paths to conversation.jsonl
- Mention file path in next message to Claude stdin (so Claude can read it)
- **Depends on:** Session with path structure
### Phase 4: Cost Optimization & Monitoring
**Goal:** Implement model selection and idle timeout.
10. **ModelSelector** (`lib/cost_optimizer.py`)
- Implement command detection logic
- Implement model selection (Haiku for commands, Opus for conversation)
- **No dependencies** - pure routing logic
11. **Update Session to use ModelSelector**
- Call ModelSelector before spawning process
- Pass selected model to `spawn_claude(model=...)`
- **Depends on:** ModelSelector
12. **IdleMonitor** (`lib/idle_monitor.py`)
- Implement background task to check last_activity timestamps
- Call `session.suspend()` on timeout
- **Depends on:** Session with suspend() method
13. **Integrate IdleMonitor into bot.py**
- Launch monitor as background task on bot startup
- Pass sessions dict to monitor
- **Depends on:** IdleMonitor
- **Testing:** Send message, wait >10min (or reduce timeout for testing), verify process terminates
### Phase 5: Production Hardening
**Goal:** Error handling, logging, recovery.
14. **Error handling**
- Add try/except around all async operations
- Implement retry logic for Claude spawn failures
- Handle Claude process crashes (respawn on next message)
- Log all errors to structured format (JSON logs for parsing)
15. **Session recovery**
- On bot startup, scan sessions/ directory
- Load all ACTIVE sessions, transition to SUSPENDED (processes are dead)
- User's next message will respawn process transparently
16. **Monitoring & Metrics**
- Add `/sessions` command to list active sessions
- Add `/session_stats` to show process count, memory usage
- Log session lifecycle events (spawn, suspend, terminate) for analysis
### Dependencies Summary
```
Phase 1 (Foundation):
Session (no deps)
SessionRouter (→ Session)
bot.py integration (→ SessionRouter)
Phase 2 (Process Management):
StreamParser (no deps)
ProcessManager (→ StreamParser)
Session integration (→ ProcessManager)
Phase 3 (Formatting):
TelegramFormatter (no deps)
Session integration (→ TelegramFormatter)
File handling (→ Session paths)
Phase 4 (Optimization):
ModelSelector (no deps) → Session integration
IdleMonitor (→ Session) → bot.py integration
Phase 5 (Hardening):
Error handling (all components)
Session recovery (→ Session, SessionRouter)
Monitoring (→ all components)
```
## Critical Design Decisions
### 1. Why Not Use `communicate()` for Interactive Sessions?
`asyncio` documentation is clear: `communicate()` is designed for one-shot commands. It sends input, **closes stdin**, reads output, and waits for process exit. For interactive sessions where we need to send multiple messages without restarting the process, we must manually manage streams with separate reader/writer tasks.
**Source:** [Python asyncio subprocess documentation](https://docs.python.org/3/library/asyncio-subprocess.html)
### 2. Why Path-Based Sessions Instead of Database?
For this scale (1-20 users), filesystem is simpler:
- **Inspection:** `ls sessions/` shows all sessions, `cat sessions/main/metadata.json` shows state
- **Backup:** `tar -czf sessions.tar.gz sessions/` is trivial
- **Debugging:** Files are human-readable JSON/JSONL
- **No dependencies:** No database server to run/maintain
At 100+ users, reconsider. But for homelab use case, filesystem wins on simplicity.
### 3. Why Separate Sessions Instead of Single Conversation?
User explicitly requested "path-based session management" in project context. Use case: separate "homelab" context from "dev" context. Single conversation would mix contexts and confuse Claude. Sessions provide clean isolation.
### 4. Why Idle Timeout Instead of Keeping Processes Forever?
Each Claude process consumes ~100-200MB RAM. On LXC container with limited resources, 10 idle processes = 1-2GB wasted. Idle timeout ensures resources freed when not in use, process transparently respawns on next message.
### 5. Why Haiku for Monitoring Commands?
Monitoring commands (`/status`, `/pbs`) invoke helper scripts that return structured data. Claude's role is minimal (format output, maybe add explanation). Haiku is sufficient and 100x cheaper. Save Opus for complex analysis and conversation.
**Cost reference:** As of 2026, Claude 4.5 Haiku costs $0.80/$4.00 per million tokens (input/output), while Opus costs $15/$75 per million tokens.
## Sources
### High Confidence (Official Documentation)
- [Python asyncio subprocess documentation](https://docs.python.org/3/library/asyncio-subprocess.html) - Process class methods, create_subprocess_exec, deadlock warnings
- [Claude Code CLI reference](https://code.claude.com/docs/en/cli-reference) - All CLI flags, --resume usage, --output-format stream-json, --no-interactive mode
- [python-telegram-bot documentation](https://docs.python-telegram-bot.org/) - Application class, async handlers, ConversationHandler for state management
### Medium Confidence (Implementation Guides & Community)
- [Python subprocess bidirectional communication patterns](https://pymotw.com/3/asyncio/subprocesses.html) - Practical examples of PIPE usage
- [Streaming subprocess stdin/stdout with asyncio](https://kevinmccarthy.org/2016/07/25/streaming-subprocess-stdin-and-stdout-with-asyncio-in-python/) - Async stream management patterns
- [Session management in Telegram bots](https://macaron.im/blog/openclaw-telegram-bot-setup) - Path-based routing, session key patterns
- [Claude Code session management guide](https://stevekinney.com/courses/ai-development/claude-code-session-management) - --resume usage, session continuity
- [Python multiprocessing best practices 2026](https://copyprogramming.com/howto/python-python-multiprocessing-process-terminate-code-example) - Process lifecycle, graceful shutdown
### Key Takeaways from Research
1. **Asyncio subprocess requires manual stream management** - Never use `communicate()` for interactive processes, must read stdout/stderr in separate tasks to avoid deadlocks
2. **Claude Code CLI supports programmatic usage** - `--output-format stream-json` + `--input-format stream-json` + `--no-interactive` enables subprocess integration
3. **Session isolation is standard pattern** - Path-based or key-based routing prevents context bleeding across conversations
4. **Idle timeout is essential** - Without cleanup, processes accumulate indefinitely, exhausting resources
5. **State machines make lifecycle explicit** - IDLE → SPAWNING → ACTIVE → PROCESSING → SUSPENDED transitions prevent race conditions and clarify behavior
---
*Architecture research for: Telegram-to-Claude Code Bridge*
*Researched: 2026-02-04*

View file

@ -1,379 +0,0 @@
# Feature Research: Telegram-to-Claude Code Bridge
**Domain:** AI chatbot bridge / Remote code assistant interface
**Researched:** 2026-02-04
**Confidence:** HIGH
## Feature Landscape
### Table Stakes (Users Expect These)
Features users assume exist. Missing these = product feels incomplete.
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Basic message send/receive | Core functionality of any chat interface | LOW | Python-telegram-bot or grammY provide this out-of-box |
| Session persistence | Users expect conversations to continue where they left off | MEDIUM | Store session state to disk/DB; must survive bot restarts |
| Command interface | Standard way to control bot behavior (`/help`, `/new`, `/status`) | LOW | Built-in to telegram bot frameworks |
| Typing indicator | Shows bot is processing (expected for AI bots with 10-60s response times) | LOW | Use `sendChatAction` every 5s during processing |
| Error messages | Clear feedback when something goes wrong | LOW | Graceful error handling with user-friendly messages |
| File upload support | Send files/images to Claude for analysis | MEDIUM | Telegram supports up to 50MB files; larger requires self-hosted Bot API |
| File download | Receive files Claude generates (scripts, configs, reports) | MEDIUM | Bot sends files back; organize in user-specific folders |
| Authentication | Only authorized users can access the bot | LOW | User ID whitelist in config (for single-user: just one ID) |
| Multi-message handling | Long responses split intelligently across multiple messages | MEDIUM | Telegram has 4096 char limit; need smart splitting at code block/paragraph boundaries |
### Differentiators (Competitive Advantage)
Features that set the product apart. Not required, but valuable.
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| Named session management | Switch between multiple projects/contexts (`/session work`, `/session personal`) | MEDIUM | Session key = user:session_name; list/switch/delete sessions |
| Idle timeout with graceful suspension | Auto-suspend idle sessions to save costs, easy resume with context preserved | MEDIUM | Timer-based monitoring; serialize session state; clear resume UX with `/resume <session>` |
| Smart output modes | Choose verbosity: final answer only / verbose with tool calls / auto-smart truncation | HIGH | Requires parsing Claude Code output stream and making intelligent display decisions |
| Tool call progress notifications | Real-time updates as Claude uses tools ("Reading file X", "Running command Y") | HIGH | Stream parsing + progressive message editing; balance info vs notification spam |
| Cost tracking per session | Show token usage and $ cost for each conversation | MEDIUM | Track input/output tokens; calculate using Anthropic pricing; display in `/stats` |
| Session-specific folders | Each session gets isolated file workspace (~/stuff/sessions/<name>/) | LOW | Create directory per session; pass as cwd to Claude Code |
| Inline keyboard menus | Button-based navigation (session list, quick commands) instead of typing | MEDIUM | Telegram InlineKeyboardMarkup for cleaner UX |
| Voice message support | Send voice, bot transcribes and processes | HIGH | Requires Whisper API or similar; adds complexity but strong UX boost |
| Photo/image analysis | Send photos, Claude analyzes with vision | MEDIUM | Claude supports vision natively; just pass image data |
| Proactive heartbeat | Bot checks in periodically ("Task done?", "Anything broken?") | HIGH | Cron-based with intelligent prompting; OpenClaw-style feature |
| Multi-model routing | Use Haiku for simple tasks, Sonnet for complex, Opus for critical | HIGH | Analyze message complexity; route intelligently; 80% cost savings potential |
| Session export | Export full conversation history as markdown/JSON | LOW | Serialize messages to file, send via Telegram |
| Undo/rollback | Revert to previous message in conversation | HIGH | Requires conversation tree management; complex but powerful |
### Anti-Features (Commonly Requested, Often Problematic)
Features that seem good but create problems.
| Feature | Why Requested | Why Problematic | Alternative |
|---------|---------------|-----------------|-------------|
| Multi-user support (v1) | Seems like natural evolution | Adds auth complexity, resource contention, security surface, and user isolation requirements before core experience is validated | Build single-user first; prove value; then add multi-user with proper tenant isolation |
| Real-time streaming text | Shows AI thinking character-by-character | Telegram message editing has rate limits; causes flickering; annoying for code blocks | Use typing indicator + tool call progress updates + send complete responses |
| Inline bot mode (@mention in any chat) | Convenience of using bot anywhere | Security nightmare (exposes bot to all chats, leaks context); hard to maintain session isolation | Keep bot in dedicated chat; use `/share` to export results elsewhere |
| Voice response (TTS) | "Complete voice assistant" feel | Adds latency, quality issues, limited Telegram voice note support, user often reading anyway | Text-first; voice input OK but output stays text |
| Auto-response to all messages | Bot always active, no explicit commands needed | Burns tokens on noise; user loses control; hard to have side conversations | Require explicit command or @mention; clear when bot is listening |
| Unlimited session history | "Never forget anything" | Memory bloat, context window waste, cost explosion | Implement sliding window (last N messages) + summarization; store full history off-context |
| Advanced NLP for command parsing | "Natural language commands" | Adds unreliability; burns tokens; users prefer explicit commands for tools | Use standard `/command` syntax; save NLP tokens for actual Claude conversations |
| Rich formatting (bold, italic, links) in bot messages | Prettier output | Telegram markdown syntax fragile; breaks on code blocks; debugging nightmare | Use plain text with clear structure; minimal formatting for critical info only |
## Feature Dependencies
```
Authentication (whitelist)
└──requires──> Session Management
├──requires──> Message Handling
│ └──requires──> Claude Code Integration
└──requires──> File Handling
└──requires──> Session Folders
Smart Output Modes
└──requires──> Output Stream Parsing
└──requires──> Message Splitting
Tool Call Progress
└──requires──> Output Stream Parsing
└──requires──> Typing Indicator
Idle Timeout
└──requires──> Session Persistence
└──requires──> Session Management
Cost Tracking
└──requires──> Token Counting
└──requires──> Claude Code Integration
Multi-Model Routing
└──requires──> Message Complexity Analysis
└──enhances──> Cost Tracking
```
### Dependency Notes
- **Session Management is foundational**: Nearly everything depends on solid session management. This must be robust before adding advanced features.
- **Output Stream Parsing enables differentiators**: Many high-value features (smart output modes, tool progress, cost tracking) require parsing Claude Code's output stream. Build this infrastructure early.
- **File Handling is isolated**: Can be built in parallel with core message flow; minimal dependencies.
- **Authentication gates everything**: Single-user whitelist is simplest; must be in place before any other features.
## MVP Definition
### Launch With (v0.1 - Prove Value)
Minimum viable product — what's needed to validate the concept.
- [ ] **User whitelist authentication** — Only owner can use bot (security baseline)
- [ ] **Basic message send/receive** — Chat with Claude Code via Telegram
- [ ] **Session persistence** — Conversations survive bot restarts
- [ ] **Simple session management**`/new`, `/continue`, `/list` commands
- [ ] **Typing indicator** — Shows bot is thinking during long AI responses
- [ ] **File upload** — Send files to Claude (PDFs, screenshots, code)
- [ ] **File download** — Receive files Claude creates
- [ ] **Error handling** — Clear messages when things break
- [ ] **Message splitting** — Long responses broken into readable chunks
- [ ] **Session folders** — Each session has isolated file workspace
**MVP Success Criteria**: Can manage homelab from phone during commute. Can send screenshot of error, Claude analyzes and suggests fix, can review and apply.
### Add After Validation (v0.2-0.5 - Polish Core Experience)
Features to add once core is working and usage patterns emerge.
- [ ] **Named sessions** — Switch between projects (`/session ansible`, `/session docker`)
- [ ] **Idle timeout with suspend/resume** — Save costs on unused sessions
- [ ] **Basic output modes** — Toggle verbose (`/verbose on`) for debugging
- [ ] **Cost tracking** — See token usage per session (`/stats`)
- [ ] **Inline keyboard menus** — Button-based session picker
- [ ] **Session export** — Download conversation as markdown (`/export`)
- [ ] **Image analysis** — Send photos, Claude describes/debugs
**Trigger for adding**: Using bot daily, patterns clear, requesting these features organically.
### Future Consideration (v1.0+ - Differentiating Power Features)
Features to defer until product-market fit is established.
- [ ] **Smart output modes** — AI decides what to show based on context
- [ ] **Tool call progress notifications** — Real-time updates on Claude's actions
- [ ] **Multi-model routing** — Haiku for simple, Sonnet for complex (cost optimization)
- [ ] **Voice message support** — Voice input with Whisper transcription
- [ ] **Proactive heartbeat** — Bot checks in on long-running tasks
- [ ] **Undo/rollback** — Revert conversation to previous state
- [ ] **Multi-user support** — Share bot with team (requires tenant isolation)
**Why defer**: These are complex, require significant engineering, and value unclear until core experience proven. Some (like multi-model routing) need usage data to optimize.
## Feature Prioritization Matrix
| Feature | User Value | Implementation Cost | Priority | Phase |
|---------|------------|---------------------|----------|-------|
| Message send/receive | HIGH | LOW | P1 | MVP |
| Session persistence | HIGH | MEDIUM | P1 | MVP |
| File upload/download | HIGH | MEDIUM | P1 | MVP |
| Typing indicator | HIGH | LOW | P1 | MVP |
| User authentication | HIGH | LOW | P1 | MVP |
| Message splitting | HIGH | MEDIUM | P1 | MVP |
| Error handling | HIGH | LOW | P1 | MVP |
| Session folders | MEDIUM | LOW | P1 | MVP |
| Basic commands | HIGH | LOW | P1 | MVP |
| Named sessions | HIGH | MEDIUM | P2 | Post-MVP |
| Idle timeout | MEDIUM | MEDIUM | P2 | Post-MVP |
| Cost tracking | MEDIUM | MEDIUM | P2 | Post-MVP |
| Inline keyboards | MEDIUM | MEDIUM | P2 | Post-MVP |
| Session export | LOW | LOW | P2 | Post-MVP |
| Image analysis | MEDIUM | MEDIUM | P2 | Post-MVP |
| Smart output modes | HIGH | HIGH | P3 | Future |
| Tool progress | MEDIUM | HIGH | P3 | Future |
| Multi-model routing | HIGH | HIGH | P3 | Future |
| Voice messages | LOW | HIGH | P3 | Future |
| Proactive heartbeat | LOW | HIGH | P3 | Future |
**Priority key:**
- P1: Must have for launch (MVP)
- P2: Should have, add when core working (Post-MVP)
- P3: Nice to have, future consideration (v1.0+)
## Competitor Feature Analysis
| Feature | OpenClaw | claude-code-telegram | Claude-Code-Remote | Our Approach |
|---------|----------|----------------------|--------------------|--------------|
| Session Management | Multi-agent sessions with isolation | Session persistence, project switching | Smart session detection (24h tokens) | Named sessions with manual switch |
| Authentication | Pairing allowlist, mention gating | User ID whitelist + optional token | User ID whitelist | Single-user whitelist (simplest) |
| File Handling | Full file operations | Directory navigation (cd/ls/pwd) | File transfers | Upload to session folders, download results |
| Progress Updates | Proactive heartbeat | Command output shown | Real-time notifications | Tool call progress (stretch goal) |
| Multi-Platform | Telegram, Discord, Slack, WhatsApp, iMessage | Telegram only | Telegram, Email, Discord, LINE | Telegram only (focused) |
| Output Management | Native streaming | Full responses | Smart content handling | Smart truncation + output modes |
| Cost Optimization | Not mentioned | Rate limiting | Cost tracking | Multi-model routing (future) |
| Voice Support | Not mentioned | Not mentioned | Not mentioned | Future consideration |
| Proactive Features | Heartbeat + cron jobs | Not mentioned | Not mentioned | Defer to v1+ |
**Our Differentiation Strategy**:
- **Simpler than OpenClaw**: No multi-platform complexity, focus on Telegram-Claude Code excellence
- **Smarter than claude-code-telegram**: Output modes, cost tracking, idle management (post-MVP)
- **More focused than Claude-Code-Remote**: Single platform, deep integration, better UX
- **Unique angle**: Cost-conscious design with multi-model routing and idle timeout (future)
## Implementation Complexity Assessment
### Low Complexity (1-2 days)
- User whitelist authentication
- Basic message send/receive
- Typing indicator
- Simple command interface
- Error messages
- Session folders
- Session export
### Medium Complexity (3-5 days)
- Session persistence (state serialization)
- File upload/download (Telegram file API)
- Message splitting (intelligent chunking)
- Named session management
- Idle timeout implementation
- Cost tracking
- Inline keyboards
- Image analysis (using Claude vision)
### High Complexity (1-2 weeks)
- Smart output modes (AI-driven truncation)
- Tool call progress parsing
- Multi-model routing (complexity analysis)
- Voice message support (Whisper integration)
- Proactive heartbeat (cron + intelligent prompting)
- Undo/rollback (conversation tree)
## Technical Considerations
### Telegram Bot Framework Options
**python-telegram-bot (Recommended)**
- Mature, well-documented (v21.8 as of 2026)
- ConversationHandler for state management
- Built-in file handling
- Already familiar to user (Python preference noted)
**Alternative: grammY (TypeScript/Node)**
- Used by OpenClaw
- Excellent session plugin
- Not aligned with user's Python preference
**Decision**: Use python-telegram-bot for consistency with existing homelab Python scripts.
### Session Storage Options
**SQLite (Recommended for MVP)**
- Simple, file-based, no server needed
- Built into Python
- Easy to backup (single file)
**Alternative: JSON files**
- Even simpler but no transaction safety
- Good for prototyping, migrate to SQLite quickly
**Decision**: Start with JSON for rapid prototyping, migrate to SQLite by v0.2.
### Claude Code Integration
**Subprocess Approach (Recommended)**
- Spawn `claude-code` CLI as subprocess
- Capture stdout/stderr
- Parse output for tool calls, costs, errors
- Clean isolation, no SDK dependency
**Challenge**: claude-code CLI doesn't expose token counts in output yet. Will need to:
1. Parse prompts/responses to estimate tokens
2. Or wait for CLI feature addition
3. Or use Anthropic API directly (breaks "use Claude Code" requirement)
### File Handling Architecture
```
~/stuff/telegram-sessions/
├── <session_name_1>/
│ ├── uploads/ # User-sent files
│ ├── downloads/ # Claude-generated files
│ └── metadata.json # Session info
└── <session_name_2>/
└── ...
```
Each session gets isolated folder on shared ZFS storage (~/stuff). Pass session folder as cwd to Claude Code.
### Cost Optimization Strategy
**Haiku vs Sonnet Pricing (2026):**
- Haiku 4.5: $1 input / $5 output per MTok
- Sonnet 4.5: $3 input / $15 output per MTok
**Haiku is 1/3 the cost of Sonnet**, performs within 5% on many tasks.
**Polling Pattern (Future Optimization)**:
- Use Haiku for idle checking: "Any new messages? Reply WAIT or process request"
- If WAIT: sleep and poll again (cheap)
- If action needed: Hand off to Sonnet for actual work
- Potential 70-80% cost reduction for always-on bot
**Not MVP**: Requires significant engineering, usage patterns unclear.
## Security & Privacy Notes
**Single-User Design Benefits:**
- No multi-tenant isolation complexity
- No user data privacy concerns (owner = user)
- Simple whitelist auth sufficient
- Can run with full system access (owner trusts self)
**Risks to Mitigate:**
- Telegram token leakage (store in config, never commit)
- User ID spoofing (validate against hardcoded whitelist)
- File upload exploits (validate file types, scan for malware if paranoid)
- Command injection via filenames (sanitize all user input)
**Session Security:**
- Sessions stored on local disk (~/stuff)
- Accessed only by bot user (mikkel)
- No encryption needed (single-user, trusted environment)
## Performance Considerations
**Telegram API Limits:**
- Bot messages: 30/sec across all chats
- Message edits: 1/sec per chat
- File uploads: 50MB default, 2000MB with self-hosted Bot API
**Implications:**
- Typing indicator: Max 1 update per 5-6 seconds (rate limit safe)
- Tool progress: Batch updates, don't spam on every tool call
- File handling: 50MB sufficient for most use cases (PDFs, screenshots, scripts)
**Claude Code Response Times:**
- Simple queries: 2-5 seconds
- Complex with tools: 10-60 seconds
- Very long responses: 60+ seconds
**Implications:**
- Typing indicator critical (users wait 10-60s regularly)
- Consider "Still working..." message at 30s mark
- Tool progress updates help perception of progress
## Sources
**Telegram Bot Features & Best Practices:**
- [Best Telegram Bots in 2026](https://chatimize.com/best-telegram-bots/)
- [Telegram AI Chatbots Best Practices](https://botpress.com/blog/top-telegram-chatbots)
- [Create Telegram Bot 2026](https://evacodes.com/blog/create-telegram-bot)
**Session Management:**
- [OpenClaw Telegram Bot Sessions](https://macaron.im/blog/openclaw-telegram-bot-setup)
- [grammY Session Plugin](https://grammy.dev/plugins/session.html)
- [python-telegram-bot ConversationHandler](https://docs.python-telegram-bot.org/en/v21.8/telegram.ext.conversationhandler.html)
**Claude Code Implementations:**
- [claude-code-telegram GitHub](https://github.com/RichardAtCT/claude-code-telegram)
- [Claude-Code-Remote GitHub](https://github.com/JessyTsui/Claude-Code-Remote)
- [OpenClaw Telegram Docs](https://docs.openclaw.ai/channels/telegram)
**Cost Optimization:**
- [Claude API Pricing 2026](https://www.metacto.com/blogs/anthropic-api-pricing-a-full-breakdown-of-costs-and-integration)
- [Claude API Pricing Guide](https://www.aifreeapi.com/en/posts/claude-api-pricing-per-million-tokens)
- [Anthropic Cost Optimization](https://www.finout.io/blog/anthropic-api-pricing)
**File Handling:**
- [Telegram File Handling](https://grammy.dev/guide/files)
- [Telegram Bot File Upload](https://telegrambots.github.io/book/3/files/upload.html)
**UX & Progress Updates:**
- [AI Assistant Streaming Responses](https://avestalabs.ai/aspire-ai-academy/gen-ai-engineering/streaming-responses)
- [Telegram Typing Indicator](https://community.latenode.com/t/how-can-a-telegram-bot-simulate-a-typing-indicator/5602)
**Timeout & Session Management:**
- [Chatbot Session Timeout Best Practices](https://quidget.ai/blog/ai-automation/chatbot-session-timeout-settings-best-practices/)
- [AI Chatbot Session Management](https://optiblack.com/insights/ai-chatbot-session-management-best-practices)
**Telegram Interface:**
- [Telegram Bot Buttons](https://core.telegram.org/bots/features)
- [Inline Keyboards](https://grammy.dev/plugins/keyboard)
---
*Feature research for: Telegram-to-Claude Code Bridge*
*Researched: 2026-02-04*
*Confidence: HIGH - All findings verified with official documentation and multiple current sources*

View file

@ -1,350 +0,0 @@
# Pitfalls Research
**Domain:** Telegram Bot + Long-Running CLI Subprocess Management
**Researched:** 2026-02-04
**Confidence:** HIGH
## Critical Pitfalls
### Pitfall 1: Asyncio Subprocess PIPE Deadlock
**What goes wrong:**
Using `asyncio.create_subprocess_exec` with `stdout=PIPE` and `stderr=PIPE` causes the subprocess to hang indefinitely when output buffers fill. The parent process awaits `proc.wait()` while the child blocks writing to the full pipe buffer, creating a classic deadlock. This is especially critical with Claude Code CLI which produces continuous streaming output.
**Why it happens:**
OS pipe buffers are finite (typically 64KB on Linux). When the child process generates more output than the buffer can hold, it blocks on write(). If the parent isn't actively draining the pipe via `proc.stdout.read()`, the pipe fills and both processes wait forever - child waits for buffer space, parent waits for process exit.
**How to avoid:**
- Use `asyncio.create_task()` to drain stdout/stderr concurrently while waiting for process
- Or use `proc.communicate()` which handles draining automatically
- Or redirect to files instead: `stdout=open('log.txt', 'w')` to bypass pipe limits
- Never call `proc.wait()` when using PIPE without concurrent reading
**Warning signs:**
- Bot hangs on specific commands that produce verbose output
- Process remains in "S" state (sleeping) indefinitely
- `strace` shows both processes blocked on read/write syscalls
- Works with short output, hangs with verbose Claude responses
**Phase to address:**
Phase 1: Core Subprocess Management - implement proper async draining patterns before any Claude integration.
---
### Pitfall 2: Telegram API Rate Limit Cascade Failures
**What goes wrong:**
When Claude Code generates output faster than Telegram allows sending (30 messages/second, 20/minute in groups), messages queue up. Without proper backpressure handling, the bot triggers `429 Too Many Requests` errors, gets rate-limited for increasing durations (exponential backoff), and eventually the entire message queue fails. Users see partial responses or total silence.
**Why it happens:**
Claude's streaming responses don't know or care about Telegram's rate limits. A single Claude interaction can produce hundreds of lines of output. Naive implementations send each chunk immediately, overwhelming Telegram's API and triggering automatic rate limiting that cascades to ALL bot operations, not just Claude responses.
**How to avoid:**
- Implement message batching: accumulate output for 1-2 seconds before sending
- Use `telegram.ext.Application`'s built-in rate limiter (v20.x+)
- Add exponential backoff with `asyncio.sleep()` on 429 errors
- Track messages/second and throttle proactively before hitting limits
- Consider chunking very long output and offering "download full log" instead
**Warning signs:**
- HTTP 429 errors in logs
- Messages arrive in bursts after long delays
- Bot becomes unresponsive to ALL commands during Claude sessions
- Telegram sends "FloodWait" exceptions with increasing wait times
**Phase to address:**
Phase 2: Telegram Integration - must be solved before exposing Claude streaming output to users.
---
### Pitfall 3: Zombie Process Accumulation
**What goes wrong:**
When the bot crashes, restarts, or processes are killed improperly, Claude Code subprocesses become zombies - still running, consuming resources, but detached from parent. On a 4GB LXC container, a few zombie processes can exhaust memory. After days/weeks, dozens of orphaned Claude processes pile up.
**Why it happens:**
Python's asyncio doesn't automatically clean up child processes on exception or when event loop closes. Calling `proc.kill()` without `await proc.wait()` leaves process in zombie state. systemd restarts don't adopt orphaned children. The Telegram bot's event loop may close while subprocesses are mid-execution.
**How to avoid:**
- Always `await proc.wait()` after termination signals
- Use `try/finally` to ensure cleanup even on exceptions
- Configure systemd `KillMode=control-group` to kill entire process tree on restart
- Implement graceful shutdown handler that waits for all subprocesses
- Use process tracking: maintain dict of active PIDs, verify cleanup on startup
**Warning signs:**
- `ps aux | grep claude` shows processes with different PPIDs or PPID=1
- Memory usage creeps up over days without corresponding active sessions
- Process count increases but active users count doesn't
- `defunct` or `<zombie>` processes in process table
**Phase to address:**
Phase 1: Core Subprocess Management - proper lifecycle management must be foundational.
---
### Pitfall 4: Session State Corruption via Race Conditions
**What goes wrong:**
When a user sends multiple Telegram messages rapidly while Claude is processing, concurrent writes to the session state file corrupt data. Session JSON becomes malformed, context is lost, Claude forgets conversation history mid-interaction. In worst case, file locking fails and two processes write simultaneously, producing invalid JSON that crashes the bot.
**Why it happens:**
Telegram's async handlers run concurrently. Message 1 starts Claude subprocess, Message 2 arrives before Message 1 finishes, both try to update `sessions/{user_id}.json`. Python's file I/O isn't atomic - one write can partially overwrite another. `json.dump()` + `f.write()` is not atomic across asyncio tasks.
**How to avoid:**
- Use `asyncio.Lock` per user: `user_locks[user_id]` ensures serial access to session state
- Or use `filelock` library for cross-process file locking
- Implement atomic writes: write to temp file, then `os.rename()` (atomic on POSIX)
- Queue user messages: new message while Claude active goes to pending queue, processed after current finishes
- Detect corruption: catch `json.JSONDecodeError` on read, backup corrupted file, start fresh session
**Warning signs:**
- `json.JSONDecodeError` in logs
- Users report "bot forgot our conversation"
- Sporadic failures only when users type quickly
- Session files contain partial/mixed JSON from multiple writes
- File size is unexpectedly small (truncation during write)
**Phase to address:**
Phase 3: Session Management - after basic subprocess handling works, before multi-user testing.
---
### Pitfall 5: Claude Code CLI --resume Footgun
**What goes wrong:**
Using `--resume` flag naively to continue sessions seems ideal, but leads to state divergence. The CLI's internal state (transcript, tool outputs, context window) drifts from what the bot thinks happened. Bot displays response A to user, but Claude's transcript shows response B due to regeneration during resume. Messages appear out of order or duplicated.
**Why it happens:**
`--resume` replays the transcript from disk and may regenerate responses if conditions changed (model version updated, non-deterministic sampling). The bot's session state stores "what we showed the user", but Claude's resumed state reflects "what actually happened in the transcript". These diverge over time, especially with tool use where results may differ on replay.
**How to avoid:**
- Avoid `--resume` entirely: start fresh subprocess per interaction, pass conversation history via stdin
- Or implement "resume detection": compare Claude's first message after resume with expected cached response, warn on mismatch
- Or treat --resume as read-only: use it to show transcript to user, but always start fresh for new input
- Store transcript path in session state, verify hash/checksum before resume to detect corruption
**Warning signs:**
- Users see repeated messages they already received
- Bot shows different response than what Claude transcript contains
- Tool use executes twice with different results
- Resume succeeds but conversation context is wrong
**Phase to address:**
Phase 4: Resume/Persistence - only after basic interaction flow is solid, requires deep understanding of transcript format.
---
### Pitfall 6: Idle Timeout Race Condition
**What goes wrong:**
Implementing "kill Claude after N minutes idle" creates a race: user sends message at T+599s, timeout fires at T+600s, both try to access the subprocess. Timeout calls `proc.kill()` while message handler calls `proc.stdin.write()`. Result: `BrokenPipeError`, message lost, user sees error instead of Claude response. In worse case, timeout cleanup runs mid-response, truncating output.
**Why it happens:**
Asyncio's `asyncio.wait_for()` and timeout tasks don't coordinate with message arrival. The timeout coroutine has no knowledge that a new message just started processing. Both coroutines operate on shared subprocess state without synchronization. Telegram's async handlers run immediately on message arrival, possibly overlapping with timeout logic.
**How to avoid:**
- Cancel timeout task BEFORE starting message processing: `timeout_task.cancel()` in message handler
- Use `asyncio.Lock` to prevent timeout cleanup during active message handling
- Implement "last activity" timestamp: timeout checks timestamp and skips cleanup if recent
- Set timeout generously (10min+) to reduce race window
- Log timeout decisions: "Killing process for user X due to idle since Y" helps debug races
**Warning signs:**
- Intermittent `BrokenPipeError` or `ValueError: I/O operation on closed file`
- Happens more often exactly at timeout threshold (e.g., always near 5min mark)
- Users report "bot randomly stops responding" mid-conversation
- Logs show process killed, then immediately new message arrives
- Error rate correlates with idle timeout duration
**Phase to address:**
Phase 5: Idle Management - only add after core interaction loop is bulletproof, requires careful async coordination.
---
### Pitfall 7: Cost Runaway from Failed Haiku Handoff
**What goes wrong:**
The plan is to use Haiku for light tasks, escalate to Opus for complex reasoning. But if escalation logic fails (Haiku doesn't recognize complexity, or handoff mechanism breaks), every request goes to Opus. A user asks 100 simple questions ("what's the weather?") and you burn through $25 in token costs instead of $1. Monthly bill explodes from $50 to $500.
**Why it happens:**
Model routing is fragile: Haiku's job is to decide "do I need Opus?" but it may be too dumb to know when it's too dumb. Complexity heuristics (token count, tool use, keywords) have false negatives. Bugs in handoff code (wrong model parameter, API error) cause fallback to default model (often the expensive one). No budget enforcement means runaway costs go unnoticed until the bill arrives.
**How to avoid:**
- Implement per-user daily/monthly cost caps: track tokens used, reject requests over limit
- Log every model decision: "User X, message Y: using Haiku because Z" for audit trail
- Monitor cost metrics in real-time: alert if hourly spend exceeds threshold
- Start with Haiku-only, add Opus escalation LATER once metrics show handoff works
- Use prompt engineering: system prompt tells Haiku "If you're uncertain, say 'I need help' instead of trying"
- Test escalation logic extensively with edge cases before production
**Warning signs:**
- Anthropic usage dashboard shows 90%+ Opus when expecting 80%+ Haiku
- Daily spend consistently above projected average
- Logs show no/few Haiku->Opus escalation events (suggests routing broken)
- Users report slow responses (Opus is slower) when they expected fast replies
- Cost-per-interaction metric increases over time without feature changes
**Phase to address:**
Phase 6: Cost Optimization - start Haiku-only in Phase 2, defer Opus handoff until usage patterns are understood.
---
## Technical Debt Patterns
Shortcuts that seem reasonable but create long-term problems.
| Shortcut | Immediate Benefit | Long-term Cost | When Acceptable |
|----------|-------------------|----------------|-----------------|
| Using `subprocess.run()` instead of asyncio subprocess | Simpler code, no async complexity | Blocks event loop, bot unresponsive during Claude calls, Telegram timeouts | Never - breaks async bot entirely |
| Storing session state in memory only (no persistence) | Fast, no file I/O, no corruption risk | Sessions lost on restart, can't implement --resume, no audit trail | MVP only - add persistence by Phase 3 |
| Single global Claude subprocess for all users | Simple: one process to manage, no spawn overhead | Security nightmare (cross-user context leak), single point of failure, no isolation | Never - violates basic security |
| No cost tracking, assume Haiku is cheap enough | Faster development, less code | Budget surprises, no visibility into usage patterns, can't optimize | Early testing only - add tracking by Phase 2 GA |
| Sending full stdout line-by-line to Telegram | Simple: `for line in stdout`, looks responsive | Rate limiting, message spam, user annoyance, API costs | Never - batch messages or stream differently |
| Killing process with `SIGKILL` instead of graceful shutdown | Reliable: process always dies immediately | No cleanup, zombie risk, corrupted state, tool operations interrupted | Emergency fallback only - use `SIGTERM` first |
## Integration Gotchas
Common mistakes when connecting to external services.
| Integration | Common Mistake | Correct Approach |
|-------------|----------------|------------------|
| Claude Code CLI | Assuming stdout contains only assistant messages | Parse JSON-lines protocol: distinguish between message types (assistant, tool, control), filter accordingly |
| Claude Code CLI | Using interactive mode (no --stdin) | Always use `--stdin` flag for programmatic control, never rely on terminal interaction |
| Telegram python-telegram-bot | Calling blocking functions in async handlers | Use `asyncio.to_thread()` for sync code, or use async subprocess APIs |
| Telegram API | Assuming message sends succeed | Handle `telegram.error.RetryAfter` (rate limit), `NetworkError` (connectivity), retry with exponential backoff |
| systemd service | Relying on `Type=simple` with asyncio | Use `Type=exec` or `Type=notify` to ensure systemd knows when service is ready, prevents premature "active" status |
| File system (inbox, sessions) | Concurrent read/write without locking | Use `filelock` library or `asyncio.Lock` for critical sections, ensure atomic operations |
## Performance Traps
Patterns that work at small scale but fail as usage grows.
| Trap | Symptoms | Prevention | When It Breaks |
|------|----------|------------|----------------|
| One subprocess per message (spawn overhead) | High CPU during bursts, slow response time | Reuse subprocess across messages in same session, only spawn once per user interaction thread | >10 messages/minute per user |
| Loading full transcript on every message | Increasing latency as conversation grows | Implement transcript pagination, only load recent context + summary | >100 messages per session (~50KB transcript) |
| Synchronous file writes to session state | Bot lag spikes during saves, Telegram timeouts | Use async file I/O (`aiofiles`) or offload to background task | >5 concurrent users writing state |
| Unbounded message queue per user | Memory grows without limit if Claude is slow | Implement queue size limit (e.g., 10 pending messages), reject new messages when full | User sends >20 messages while waiting |
| Regex parsing of Claude output line-by-line | CPU spikes with verbose responses | Parse once per message chunk, not per line; use JSON protocol when possible | Claude outputs >1000 lines |
| Keeping all session objects in memory | Works fine... until OOM | Implement LRU cache with max size, evict inactive sessions after timeout | >50 concurrent sessions on 4GB RAM |
## Security Mistakes
Domain-specific security issues beyond general web security.
| Mistake | Risk | Prevention |
|---------|------|------------|
| Trusting Telegram user_id without verification | Malicious user spoofs authorized user ID via API | Check `authorized_users` file on EVERY command, validate against Telegram's cryptographic signatures |
| Passing user input directly to subprocess args | Command injection: user sends `/ping; rm -rf /` | Strict input validation, use shlex.quote(), never use shell=True |
| Exposing Claude Code's file system access to users | User asks Claude "read /etc/shadow", Claude complies | Implement tool use filtering, whitelist allowed paths, run Claude subprocess in restricted namespace |
| Storing Telegram bot token in code or world-readable file | Token leak allows full bot takeover | Store in `credentials` file with 600 permissions, never commit to git |
| No rate limiting on expensive operations | DoS: user spams bot with Claude requests until OOM/cost limit | Per-user rate limit (e.g., 10 messages/hour), queue depth limit, kill runaway processes |
| Logging sensitive data (messages, API keys) | Log leakage exposes private conversations | Redact message content in logs, only log metadata (user_id, timestamp, status) |
## UX Pitfalls
Common user experience mistakes in this domain.
| Pitfall | User Impact | Better Approach |
|---------|-------------|-----------------|
| No feedback while Claude thinks | User waits in silence, assumes bot is broken | Send "Claude is thinking..." immediately, update with "..." every 5s, show typing indicator |
| Dumping full Claude output as single 4000-char message | Wall of text, hard to read, loses context | Split into logical chunks (by paragraph/section), send as multiple messages with slight delay |
| No way to stop runaway Claude response | User watches helplessly as bot spams hundreds of lines | Implement `/stop` command, show progress "Sending response X/Y", allow cancellation |
| Silent failures | Message disappears into void, no error message | Always confirm receipt: "Got it, processing..." or "Error: rate limit, try again" |
| No context on what Claude knows | User confused why bot remembers/forgets things | Show session state: "Session started 10 min ago, 5 messages" or "New session (use /resume to continue)" |
| Cryptic error messages from Claude subprocess | "Error: exit code 1" means nothing to user | Parse Claude's stderr, translate to user-friendly: "Claude encountered an error: [specific reason]" |
## "Looks Done But Isn't" Checklist
Things that appear complete but are missing critical pieces.
- [ ] **Subprocess cleanup:** Often missing `await proc.wait()` after kill - verify all code paths call wait()
- [ ] **Error handling on Telegram API:** Often missing retry logic on 429/5xx - verify every `await bot.send_*()` has try/except
- [ ] **File locking for session state:** Often missing locks on concurrent read/modify/write - verify atomicity with `filelock` tests
- [ ] **Graceful shutdown:** Often missing SIGTERM handler - verify systemd restart doesn't leave zombies via `ps aux` check
- [ ] **Cost tracking:** Often logs tokens but doesn't enforce limits - verify limit exceeded actually rejects requests
- [ ] **Idle timeout cancellation:** Often sets timeout but forgets to cancel on new activity - test rapid message burst at T+timeout-1s
- [ ] **Output buffering/draining:** Often uses PIPE but forgets to drain - test with verbose Claude output (>100KB)
- [ ] **Model selection logging:** Often switches models but doesn't log decision - verify audit trail shows which model was used and why
## Recovery Strategies
When pitfalls occur despite prevention, how to recover.
| Pitfall | Recovery Cost | Recovery Steps |
|---------|---------------|----------------|
| Deadlocked subprocess | LOW | Detect via timeout on `proc.wait()`, send SIGKILL, cleanup session state, notify user "session crashed, please retry" |
| Zombie process accumulation | LOW | Scan for zombies on startup (`ps -eo pid,ppid,stat,cmd`), kill all matching Claude processes, clear stale session files |
| Corrupted session state | LOW | Catch `json.JSONDecodeError`, backup corrupted file to `sessions/corrupted/{user_id}_{timestamp}.json`, start fresh session |
| Rate limit cascade | MEDIUM | Pause all message processing for backoff duration (from 429 response), queue incoming messages, resume when limit resets |
| Cost runaway | MEDIUM | Detect via Anthropic API usage endpoint, auto-disable bot, send alert, manual review before re-enable |
| State divergence (--resume) | HIGH | Compare expected vs actual transcript hash on resume, reject resume if mismatch, fallback to fresh session with context summary |
| Race condition on timeout | LOW | Log all process lifecycle events, correlate timestamps to identify race, fix with locking, restart affected user sessions |
## Pitfall-to-Phase Mapping
How roadmap phases should address these pitfalls.
| Pitfall | Prevention Phase | Verification |
|---------|------------------|--------------|
| Asyncio subprocess PIPE deadlock | Phase 1: Core Subprocess | Test with synthetic >64KB output, verify no hang |
| Telegram rate limit cascade | Phase 2: Telegram Integration | Stress test: send 100 rapid messages, verify batching/throttling works |
| Zombie process accumulation | Phase 1: Core Subprocess | Kill bot during active Claude call, restart, verify no zombies via `ps aux` |
| Session state corruption | Phase 3: Session Management | Test concurrent message bombardment (10 messages in 1s), verify state integrity |
| Claude Code --resume footgun | Phase 4: Resume/Persistence | Resume session, compare transcript hash, verify no divergence |
| Idle timeout race condition | Phase 5: Idle Management | Send message at T+timeout-1s, verify no BrokenPipeError |
| Cost runaway from failed Haiku handoff | Phase 6: Cost Optimization | Simulate 100 requests, verify model distribution matches expectations (80% Haiku) |
## Sources
**Telegram Bot + Subprocess Management:**
- [Building Robust Telegram Bots](https://henrywithu.com/building-robust-telegram-bots/)
- [Common Mistakes When Building Telegram Bots with Node.js](https://infinitejs.com/posts/common-mistakes-telegram-bots-nodejs/)
- [python-telegram-bot Concurrency Wiki](https://github.com/python-telegram-bot/python-telegram-bot/wiki/Concurrency)
- [GitHub Issue #3887: PTB Hangs with Large Update Volume](https://github.com/python-telegram-bot/python-telegram-bot/issues/3887)
**Asyncio Subprocess Pitfalls:**
- [Python CPython Issue #115787: Deadlock in create_subprocess_exec with Semaphore and PIPE](https://github.com/python/cpython/issues/115787)
- [Python.org Discussion: Details of process.wait() Deadlock](https://discuss.python.org/t/details-of-process-wait-deadlock/69481)
- [Python Official Docs: Asyncio Subprocesses](https://docs.python.org/3/library/asyncio-subprocess.html)
**Zombie Processes:**
- [Python asyncio Issue #281: Zombies with set_event_loop(None)](https://github.com/python/asyncio/issues/281)
- [Python CPython Issue #95899: Runner+PidfdChildWatcher Leaves Zombies](https://github.com/python/cpython/issues/95899)
- [Sling Academy: Python asyncio - How to Stop/Kill a Child Process](https://www.slingacademy.com/article/python-asyncio-how-to-stop-kill-a-child-process/)
**Telegram API Limits:**
- [Telegram Bots FAQ: Rate Limits](https://core.telegram.org/bots/faq)
- [Telegram Limits Reference](https://limits.tginfo.me/en)
- [BigMike.help: Local Telegram Bot API Advantages](https://bigmike.help/en/case/local-telegram-bot-api-advantages-limitations-of-the-standard-api-and-set-eb4a3b/)
**Claude Code CLI Protocol:**
- [Inside the Claude Agent SDK: stdin/stdout Communication](https://buildwithaws.substack.com/p/inside-the-claude-agent-sdk-from)
- [Claude Code CLI Reference](https://code.claude.com/docs/en/cli-reference)
- [Building an MCP Server for Claude Code](https://veelenga.github.io/building-mcp-server-for-claude/)
**systemd Process Management:**
- [systemd Advanced Guide for 2026](https://medium.com/@springmusk/systemd-advanced-guide-for-2026-b2fe79af3e78)
- [Arch Linux Forums: Restart systemd Service Without Killing Children](https://bbs.archlinux.org/viewtopic.php?id=212380)
- [systemd.service Manual](https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html)
**Python Asyncio Memory Leaks:**
- [Python CPython Issue #85865: Memory Leak with asyncio and run_in_executor](https://github.com/python/cpython/issues/85865)
- [Victor Stinner: asyncio WSASend() Memory Leak](https://vstinner.github.io/asyncio-proactor-wsasend-memory-leak.html)
**Race Conditions & Concurrency:**
- [Medium: Avoiding File Conflicts in Multithreaded Python](https://medium.com/@aman.deep291098/avoiding-file-conflicts-in-multithreaded-python-programs-34f2888f4521)
- [Super Fast Python: Multiprocessing Race Conditions](https://superfastpython.com/multiprocessing-race-condition-python/)
- [Python CPython Issue #92824: asyncio.wait_for() Race Conditions](https://github.com/python/cpython/issues/92824)
- [Nicholas: Race Conditions with asyncio in Python](https://nicholaslyz.com/blog/2024/03/22/race-conditions-with-asyncio-in-python/)
**Claude API Cost Optimization:**
- [Claude API Pricing Guide 2026](https://www.aifreeapi.com/en/posts/claude-api-pricing-per-million-tokens)
- [MetaCTO: Anthropic Claude API Pricing 2026](https://www.metacto.com/blogs/anthropic-api-pricing-a-full-breakdown-of-costs-and-integration)
- [Finout: Anthropic API Pricing & Cost Optimization Strategies](https://www.finout.io/blog/anthropic-api-pricing)
- [GitHub Issue #17772: Programmatic Model Switching for Autonomous Agents](https://github.com/anthropics/claude-code/issues/17772)
---
*Pitfalls research for: Telegram-to-Claude Code bridge (brownfield Python bot extension)*
*Researched: 2026-02-04*

View file

@ -1,272 +0,0 @@
# Stack Research
**Domain:** Telegram bot with Claude Code CLI subprocess management
**Researched:** 2026-02-04
**Confidence:** HIGH
## Recommended Stack
### Core Technologies
| Technology | Version | Purpose | Why Recommended |
|------------|---------|---------|-----------------|
| Python | 3.12+ | Runtime environment | Already deployed (3.12.3), excellent asyncio support, required by python-telegram-bot 22.6 (needs 3.10+) |
| python-telegram-bot | 22.6 | Telegram Bot API wrapper | Latest stable (Jan 2026), native async/await, httpx-based (modern), active maintenance, supports Bot API 9.3 |
| asyncio | stdlib | Async/await runtime | Native subprocess management with create_subprocess_exec, non-blocking I/O for multiple concurrent sessions |
| httpx | 0.27-0.28 | HTTP client | Required dependency of python-telegram-bot 22.6, modern async HTTP library |
### Supporting Libraries
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| aiofiles | 25.1.0 | Async file I/O | Reading/writing session files, inbox processing, file uploads without blocking event loop |
| APScheduler | 3.11.2 | Job scheduling | Idle timeout timers, periodic polling checks, session cleanup; AsyncIOScheduler supports native coroutines |
| ptyprocess | 0.7.0 | PTY management | If Claude Code requires interactive terminal (TTY detection); NOT needed if --resume works with pipes |
### Development Tools
| Tool | Purpose | Notes |
|------|---------|-------|
| systemd | Service management | Existing telegram-bot.service, user service with proper delegation |
| Python venv | Dependency isolation | Already deployed at ~/venv, keeps system Python clean |
## Installation
```bash
# Activate existing venv
source ~/venv/bin/activate
# Core dependencies (if not already installed)
pip install python-telegram-bot==22.6
# Supporting libraries
pip install aiofiles==25.1.0
pip install APScheduler==3.11.2
# Optional: PTY support (only if needed for Claude Code)
pip install ptyprocess==0.7.0
```
## Alternatives Considered
| Recommended | Alternative | When to Use Alternative |
|-------------|-------------|-------------------------|
| asyncio subprocess | threading + subprocess.Popen | Never for this use case; asyncio is superior for I/O-bound operations with multiple sessions |
| python-telegram-bot | pyTelegramBotAPI (telebot) | If starting from scratch and wanting simpler API, but python-telegram-bot offers better async integration |
| APScheduler | asyncio.create_task + sleep loop | Simple timeout logic only; APScheduler overkill if just tracking last activity timestamp |
| aiofiles | asyncio thread executor + sync I/O | Small files only; for session logs and file handling, aiofiles cleaner |
| asyncio.create_subprocess_exec | ptyprocess | If Claude Code needs TTY/color output; start with pipes first, add PTY if needed |
## What NOT to Use
| Avoid | Why | Use Instead |
|-------|-----|-------------|
| Batch API for polling | Polling needs instant response, batch has 24hr latency | Real-time API calls with Haiku |
| Synchronous subprocess.Popen | Blocks event loop, kills concurrency | asyncio.create_subprocess_exec |
| Global timeout on subprocess | Claude Code may take variable time per task | Per-session idle timeout tracking |
| telegram.Bot (sync) | python-telegram-bot 20+ is async-first | telegram.ext.Application (async) |
| flask/django for webhooks | Overkill for single-user bot | python-telegram-bot's built-in polling |
## Stack Patterns by Variant
**Session Management Pattern:**
- Use `asyncio.create_subprocess_exec(['claude', '--resume'], cwd=session_path, stdout=PIPE, stderr=PIPE)`
- Set `cwd` to session directory: `~/telegram/sessions/<name>/`
- Claude Code creates `.claude/` in working directory for session state
- Each session isolated by filesystem path
**Idle Timeout Pattern:**
- APScheduler's AsyncIOScheduler with IntervalTrigger checks every 30-60s
- Track `last_activity_time` per session in memory (dict)
- On timeout: call `process.terminate()`, wait for graceful exit, mark session as suspended
- On new message: if suspended, spawn new process with `--resume` in same directory
**Cost-Optimized Polling Pattern:**
- Main polling loop: python-telegram-bot's `run_polling()` with Haiku context
- Haiku evaluates: "Does this need a response?" (simple commands vs conversation)
- If yes: spawn/resume Opus session, pass message, capture output
- If no: handle with built-in command handlers (/status, /pbs, etc.)
**Output Streaming Pattern:**
- `await process.stdout.readline()` in async loop until EOF
- Send incremental Telegram messages for tool-call notifications
- Use `asyncio.Queue` to buffer output between read loop and Telegram send loop
- Avoid deadlock: use `communicate()` for simple cases, `readline()` for streaming
**File Handling Pattern:**
- Telegram bot saves files to `sessions/<name>/files/`
- Claude Code automatically sees files in working directory
- Use aiofiles for async downloads: `async with aiofiles.open(path, 'wb') as f: await f.write(data)`
## Version Compatibility
| Package A | Compatible With | Notes |
|-----------|-----------------|-------|
| python-telegram-bot 22.6 | httpx 0.27-0.28 | Required dependency, auto-installed |
| python-telegram-bot 22.6 | Python 3.10-3.14 | Official support range, tested on 3.12 |
| APScheduler 3.11.2 | asyncio stdlib | AsyncIOScheduler native coroutine support |
| aiofiles 25.1.0 | Python 3.9-3.14 | Thread pool delegation, works with asyncio |
| ptyprocess 0.7.0 | Unix only | LXC container on Linux, no Windows needed |
## Process Management Deep Dive
### Why asyncio.create_subprocess_exec (not shell, not Popen)
**Correct approach:**
```python
process = await asyncio.create_subprocess_exec(
'claude', '--resume',
cwd=session_path,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
stdin=asyncio.subprocess.PIPE
)
```
**Why this over create_subprocess_shell:**
- Direct exec avoids shell injection risks (even with single user, good hygiene)
- More control over arguments and environment
- Slightly faster (no shell intermediary)
**Why this over threading + subprocess.Popen:**
- Non-blocking: multiple Claude sessions can run concurrently
- Event loop integration: natural with python-telegram-bot's async handlers
- Resource efficient: no thread overhead per session
### Claude Code CLI Integration Approach
**Discovery needed:**
1. Test if `claude --resume` works with stdin/stdout pipes (likely yes)
2. If Claude Code detects non-TTY and disables features, try ptyprocess
3. Verify --resume preserves conversation history across process restarts
**Stdin handling:**
- Write prompt to stdin: `process.stdin.write(message.encode() + b'\n')`
- Close stdin to signal end: `process.stdin.close()`
- Or use `communicate()` for simple request-response
**Stdout/stderr handling:**
- Tool calls likely go to stderr (or special markers in stdout)
- Parse output for progress indicators vs final answer
- Buffer partial lines, split on `\n` for structured output
### Session Lifecycle
```
State machine:
IDLE → (message arrives) → SPAWNING → RUNNING → (response sent) → IDLE
(timeout) → SUSPENDED
(new message) → RESUMING → RUNNING
```
**Implementation:**
- IDLE: No process running, session directory exists
- SPAWNING: `await create_subprocess_exec()` in progress
- RUNNING: Process alive, `process.returncode is None`
- SUSPENDED: Process terminated, ready for --resume
- RESUMING: Re-spawning with --resume flag
**Graceful shutdown:**
- Send SIGTERM: `process.terminate()`
- Wait with timeout: `await asyncio.wait_for(process.wait(), timeout=10)`
- Force kill if needed: `process.kill()`
- Claude Code should flush conversation state on SIGTERM
## Haiku Polling Strategy
**Architecture:**
```
[Telegram Message] → [Haiku Triage] → Simple? → [Execute Command]
↓ Complex? ↓
[Spawn Opus Session]
```
**Haiku's role:**
- Read message content
- Classify: command, question, or conversation
- For commands: map to existing handlers (/status → status())
- For conversation: trigger Opus session
**Implementation options:**
**Option A: Anthropic API directly**
- Separate Haiku API call per message
- Lightweight prompt: "Classify this message: [message]. Output: COMMAND, QUESTION, or CHAT"
- Pro: Fast, cheap ($1/MTok input, $5/MTok output)
- Con: Extra API integration beyond Claude Code
**Option B: Haiku via Claude Code CLI**
- `claude --model haiku "Is this a command or conversation: [message]"`
- Pro: Reuses Claude Code setup, consistent interface
- Con: Spawns extra process per triage
**Recommendation: Option A for production, Option B for MVP**
- MVP: Skip Haiku triage, spawn Opus for all messages (simpler)
- Production: Add Haiku API triage once Opus costs become noticeable
**Batch API consideration:**
- NOT suitable for polling: 24hr latency unacceptable
- MAYBE suitable for session cleanup: "Summarize and compress old sessions" overnight
## Resource Constraints (4GB RAM, 4 CPU)
**Memory budget:**
- python-telegram-bot: ~50MB base
- Each Claude Code subprocess: estimate 100-300MB
- Safe concurrent sessions: 3-4 active, 10+ suspended
- File uploads: stream to disk with aiofiles, don't buffer in RAM
**CPU considerations:**
- I/O bound workload (Telegram API, Claude API, disk)
- asyncio perfect fit: single-threaded event loop handles concurrency
- Claude Code subprocess CPU usage unknown: monitor with `process.cpu_percent()`
**Disk constraints:**
- Session directories grow with conversation history
- Periodic cleanup: delete sessions inactive >30 days
- File uploads: cap at 100MB per file (Telegram bot API limit is 50MB)
## Security Considerations
**Single-user simplification:**
- No auth beyond existing Telegram bot authorization
- Session isolation not security boundary (all same Unix user)
- BUT: still isolate by path for organization, not security
**Command injection prevention:**
- Use `create_subprocess_exec()` with argument list (not shell)
- Validate session names: `[a-z0-9_-]+` only
- Don't pass user input directly to shell commands
**File handling:**
- Save files with sanitized names: `timestamp_originalname`
- Check file extensions: allow common types, reject executables
- Limit file size: 100MB hard cap
## Sources
### High Confidence (Official Documentation)
- [python-telegram-bot PyPI](https://pypi.org/project/python-telegram-bot/) — Version 22.6, dependencies
- [python-telegram-bot Documentation](https://docs.python-telegram-bot.org/) — v22.6 API reference
- [Python asyncio Subprocess](https://docs.python.org/3/library/asyncio-subprocess.html) — Official stdlib docs (Feb 2026)
- [aiofiles PyPI](https://pypi.org/project/aiofiles/) — Version 25.1.0
- [APScheduler PyPI](https://pypi.org/project/APScheduler/) — Version 3.11.2
- [ptyprocess PyPI](https://pypi.org/project/ptyprocess/) — Version 0.7.0
- [Claude Code CLI Reference](https://code.claude.com/docs/en/cli-reference) — Official documentation
### Medium Confidence (Verified Community Sources)
- [Async IO in Python: Subprocesses (Medium)](https://medium.com/@kalmlake/async-io-in-python-subprocesses-af2171d1ff31) — Subprocess patterns
- [Better Stack: Timeouts in Python](https://betterstack.com/community/guides/scaling-python/python-timeouts/) — Timeout best practices
- [APScheduler Guide (Better Stack)](https://betterstack.com/community/guides/scaling-python/apscheduler-scheduled-tasks/) — Job scheduling patterns
- [Anthropic API Pricing (Multiple)](https://www.finout.io/blog/anthropic-api-pricing) — Haiku costs, batch API
### Low Confidence (Needs Validation)
- Claude Code --resume behavior with pipes vs PTY — Not documented, needs testing
- Claude Code output format for tool calls — Needs empirical observation
- Claude Code resource usage per session — Unknown, monitor in practice
---
*Stack research for: Telegram Claude Code Bridge*
*Researched: 2026-02-04*

View file

@ -1,220 +0,0 @@
# Project Research Summary
**Project:** Telegram-to-Claude Code Bridge
**Domain:** AI chatbot integration / Long-running subprocess management
**Researched:** 2026-02-04
**Confidence:** HIGH
## Executive Summary
This project extends an existing single-user Telegram bot to spawn and manage Claude Code CLI subprocesses, enabling conversational AI assistance via Telegram with persistent sessions. The core challenge is managing long-running interactive CLI processes through asyncio while avoiding common pitfalls like pipe deadlocks, zombie processes, and rate limiting cascades.
The recommended approach uses Python 3.12+ with python-telegram-bot 22.6 for the Telegram interface, asyncio subprocess management for Claude Code CLI integration, and path-based session routing with isolated filesystem directories. Each session maps to a directory containing metadata, conversation history, and file attachments. The architecture implements a state machine (IDLE → SPAWNING → ACTIVE → PROCESSING → SUSPENDED) with idle timeout monitors to prevent resource exhaustion on the 4GB container.
The critical risks are asyncio subprocess PIPE deadlocks (mitigated by concurrent stdout/stderr draining), zombie process accumulation (mitigated by proper lifecycle management with try/finally cleanup), and Telegram API rate limiting (mitigated by message batching and backpressure handling). Cost optimization through Haiku/Opus model routing should be deferred until core functionality is proven, as routing complexity introduces significant failure modes.
## Key Findings
### Recommended Stack
The stack leverages existing infrastructure (Python 3.12.3, systemd user service) and adds modern async libraries optimized for I/O-bound subprocess management. All dependencies are available in recent stable versions with good asyncio integration.
**Core technologies:**
- **Python 3.12+**: Already deployed, excellent asyncio support, required by python-telegram-bot 22.6
- **python-telegram-bot 22.6**: Latest stable (Jan 2026), native async/await, httpx-based, supports Bot API 9.3
- **asyncio (stdlib)**: Native subprocess management with create_subprocess_exec, non-blocking I/O for concurrent sessions
- **aiofiles 25.1.0**: Async file I/O for session logs and file uploads without blocking event loop
- **APScheduler 3.11.2**: Job scheduling for idle timeout timers, AsyncIOScheduler supports native coroutines
**Key pattern:**
Use `asyncio.create_subprocess_exec(['claude', '--resume'], cwd=session_path, stdout=PIPE, stderr=PIPE)` with separate async reader tasks to avoid deadlocks. Never use `communicate()` for interactive processes. Session isolation achieved through filesystem paths, not security boundaries.
### Expected Features
Research indicates a clear MVP path with features that users expect from chat-based AI assistants, plus differentiators that leverage Claude Code's unique capabilities.
**Must have (table stakes):**
- Basic message send/receive — core functionality
- Session persistence — conversations survive bot restarts
- Typing indicator — expected for 10-60s AI response times
- File upload/download — send files to Claude, receive generated outputs
- Error messages — clear feedback when things break
- Multi-message handling — split long responses at 4096 char Telegram limit
- Authentication — user whitelist (single-user: one ID)
**Should have (competitive):**
- Named session management — switch between projects (/session homelab, /session dev)
- Idle timeout with suspend/resume — auto-suspend after 10min idle, save costs
- Session-specific folders — isolated file workspace per session
- Cost tracking per session — show token usage and $ cost
- Inline keyboard menus — button-based navigation
- Image analysis — send photos, Claude analyzes with vision
**Defer (v2+):**
- Smart output modes — AI decides verbosity based on context (HIGH complexity)
- Tool call progress notifications — real-time updates (HIGH complexity, rate limit risk)
- Multi-model routing — Haiku for simple, Opus for complex (HIGH complexity, cost runaway risk)
- Voice message support — transcription via Whisper (HIGH complexity)
- Multi-user support — requires tenant isolation, auth complexity
### Architecture Approach
The architecture implements a layered design with clear separation of concerns: Telegram event handling, session routing, process lifecycle management, and output formatting. Each layer has a single responsibility and communicates through well-defined interfaces.
**Major components:**
1. **Bot Event Loop** — Receives Telegram updates, dispatches to handlers via python-telegram-bot Application
2. **SessionRouter** — Maps chat_id to session path, creates directories, loads/saves metadata
3. **Session (state machine)** — Owns lifecycle transitions (IDLE → SPAWNING → ACTIVE → PROCESSING → SUSPENDED), tracks last activity
4. **ProcessManager** — Spawns Claude CLI subprocess with asyncio.create_subprocess_exec, manages stdin/stdout/stderr streams with separate reader tasks
5. **StreamParser** — Parses Claude output (assumes stream-json or line-by-line text), accumulates chunks
6. **ResponseFormatter** — Applies Telegram Markdown, splits at 4096 chars, handles code blocks
7. **IdleMonitor** — Background task checks last_activity timestamps every 60s, suspends idle sessions
**Data flow:** Telegram Update → Handler → Router → Session → ProcessManager → Claude stdin. Claude stdout → Reader task → Parser → Formatter → Telegram API. Files saved to sessions/<name>/images/ or files/, logged to conversation.jsonl.
### Critical Pitfalls
Research identified seven major failure modes, prioritized by impact and likelihood.
1. **Asyncio Subprocess PIPE Deadlock** — OS pipe buffers fill (64KB) when Claude produces verbose output, child blocks on write(), parent waits for exit, both hang forever. **Avoidance:** Use asyncio.create_task() to drain stdout/stderr concurrently, never call proc.wait() when using PIPE without concurrent reading.
2. **Telegram API Rate Limit Cascade** — Claude streams output faster than Telegram allows (30 msg/sec), triggers 429 errors, cascades to ALL bot operations. **Avoidance:** Implement message batching (accumulate 1-2s before sending), use python-telegram-bot's built-in rate limiter, add exponential backoff on 429.
3. **Zombie Process Accumulation** — Bot crashes/restarts leave orphaned Claude processes consuming memory, exhaust 4GB container. **Avoidance:** Always await proc.wait() after termination, use try/finally cleanup, configure systemd KillMode=control-group, verify cleanup on startup.
4. **Session State Corruption via Race Conditions** — Concurrent writes to metadata.json corrupt data when user sends rapid messages. **Avoidance:** Use asyncio.Lock per user, atomic writes (write to temp, os.rename()), queue messages (new message while active goes to pending queue).
5. **Idle Timeout Race Condition** — User sends message at T+599s, timeout fires at T+600s, both access subprocess, BrokenPipeError. **Avoidance:** Cancel timeout task BEFORE message processing, use asyncio.Lock, check last_activity timestamp before cleanup.
## Implications for Roadmap
Based on research, the project should be built in 5-6 phases with strict ordering to ensure foundational patterns are correct before adding complexity.
### Phase 1: Session Foundation
**Rationale:** Must establish multi-session filesystem structure and routing BEFORE subprocess complexity. Path-based isolation is foundational — nearly everything depends on solid session management per FEATURES.md dependency analysis.
**Delivers:** Session class with metadata.json schema, SessionRouter with chat_id → session_name mapping, /session <name> command, conversation.jsonl append logging.
**Addresses:** Session persistence (table stakes), named session management (differentiator), session folders (differentiator).
**Avoids:** Session state corruption pitfall by establishing atomic write patterns and locking semantics early.
### Phase 2: Process Management
**Rationale:** Core subprocess integration must be bulletproof before adding Telegram integration. ARCHITECTURE.md build order explicitly sequences StreamParser → ProcessManager → Session integration. Must avoid PIPE deadlock pitfall from day one.
**Delivers:** ProcessManager with asyncio.create_subprocess_exec, separate stdout/stderr reader tasks, graceful shutdown with try/finally cleanup, StreamParser for Claude output.
**Uses:** asyncio stdlib subprocess, proper draining patterns from STACK.md.
**Implements:** ProcessManager and StreamParser components from ARCHITECTURE.md.
**Avoids:** Asyncio PIPE deadlock (#1 critical pitfall) and zombie process accumulation (#3 critical pitfall) through proper lifecycle management.
### Phase 3: Telegram Integration
**Rationale:** With subprocess management working, integrate with Telegram API and handle rate limiting. ARCHITECTURE.md sequences formatter → session integration → file handling.
**Delivers:** TelegramFormatter for message chunking and Markdown, integration with bot handlers, file upload/download to session directories, typing indicator, error messages.
**Addresses:** Message splitting, typing indicator, file upload/download, error handling (all table stakes from FEATURES.md).
**Avoids:** Telegram rate limit cascade (#2 critical pitfall) through message batching and backpressure.
### Phase 4: Idle Management
**Rationale:** Only add idle timeout AFTER core interaction loop is proven. PITFALLS.md explicitly warns "only add after core interaction loop is bulletproof, requires careful async coordination."
**Delivers:** IdleMonitor background task, last_activity tracking, graceful suspend on timeout, transparent resume on next message.
**Addresses:** Idle timeout with suspend/resume (differentiator from FEATURES.md).
**Implements:** IdleMonitor component from ARCHITECTURE.md.
**Avoids:** Idle timeout race condition (#5 critical pitfall) through timeout task cancellation and locking.
### Phase 5: Production Hardening
**Rationale:** Add observability, error recovery, and session cleanup after core features work. ARCHITECTURE.md Phase 5 focuses on error handling, session recovery, monitoring.
**Delivers:** Error handling with retry logic, session recovery on bot restart (scan sessions/, transition ACTIVE → SUSPENDED), /sessions and /session_stats commands, structured logging.
**Addresses:** Operational requirements not captured in feature research.
**Avoids:** Technical debt accumulation by codifying error handling patterns early.
### Phase 6: Cost Optimization (DEFER)
**Rationale:** Multi-model routing (Haiku/Opus) should be deferred until usage patterns are clear. PITFALLS.md identifies cost runaway as critical risk (#7), STACK.md recommends "start Haiku-only in Phase 2, defer Opus handoff until usage patterns understood."
**Delivers:** ModelSelector for command vs conversation classification, Haiku for monitoring commands (/status, /pbs), Opus for conversation, cost tracking and limits.
**Addresses:** Cost tracking (differentiator), multi-model routing (deferred feature).
**Avoids:** Cost runaway from failed Haiku handoff (#7 critical pitfall) by deferring until metrics validate routing logic.
### Phase Ordering Rationale
- **Sessions first, subprocess second:** FEATURES.md dependency graph shows session management is foundational. Path-based routing must work before spawning processes in those paths.
- **Process management isolated from Telegram:** ARCHITECTURE.md build order separates subprocess concerns (Phase 2) from Telegram integration (Phase 3). This allows testing Claude interaction without rate limiting complications.
- **Idle timeout only after core proven:** PITFALLS.md explicitly warns about idle timeout race conditions. Adding timeout logic to unproven interaction loop creates debugging nightmare.
- **Cost optimization last:** PITFALLS.md shows model routing complexity creates failure modes (wrong model, fallback bugs, heuristic failures). Defer until core value proven and usage data available for optimization.
### Research Flags
Phases likely needing deeper research during planning:
- **Phase 2 (Process Management):** Claude Code CLI --resume behavior with pipes vs PTY unknown, output format for tool calls not documented, needs empirical testing.
- **Phase 3 (Telegram Integration):** Message batching strategy needs validation against actual Claude output patterns, chunk split points require experimentation.
Phases with standard patterns (skip research-phase):
- **Phase 1 (Session Foundation):** Filesystem-based session management is well-documented, JSON schema is straightforward.
- **Phase 4 (Idle Management):** APScheduler patterns are standard, timeout logic is proven pattern.
- **Phase 5 (Production Hardening):** Error handling and logging are general Python best practices.
## Confidence Assessment
| Area | Confidence | Notes |
|------|------------|-------|
| Stack | HIGH | All dependencies verified with official PyPI/documentation sources, versions current as of Jan 2026 |
| Features | HIGH | Based on official Telegram Bot documentation and multiple current implementations (OpenClaw, claude-code-telegram, Claude-Code-Remote) |
| Architecture | HIGH | Asyncio subprocess patterns verified with Python official docs, state machine approach proven in OpenClaw session management |
| Pitfalls | HIGH | Deadlock pitfalls documented in Python CPython issues, rate limiting in Telegram official docs, zombie processes in asyncio issue tracker |
**Overall confidence:** HIGH
All four research areas grounded in official documentation and verified with multiple independent sources. Stack versions confirmed via API queries (not training data). Architecture patterns validated against Python stdlib documentation. Pitfalls sourced from actual bug reports and issue trackers, not speculation.
### Gaps to Address
While confidence is high, some areas require empirical validation during implementation:
- **Claude Code CLI output format:** Documentation mentions stream-json support but exact event schema not published. Will need to test `--output-format stream-json` flag and parse actual output to determine message boundaries, tool call markers, and error formats.
- **Claude Code --resume behavior:** Whether --resume preserves context across process restarts with stdin/stdout pipes (vs requiring TTY) is not documented. STACK.md notes "needs testing" for TTY detection. May need ptyprocess library if pipes insufficient.
- **Optimal idle timeout duration:** 10 minutes suggested based on general chatbot patterns, but actual usage may require tuning. Monitor session activity patterns in Phase 4 to optimize.
- **Message batching strategy:** 1-2 second accumulation recommended to avoid rate limits, but optimal batch size depends on Claude response patterns. Phase 3 should experiment with chunk sizes and timing.
- **Resource usage per session:** Claude Code memory footprint estimated at 100-300MB but not verified. Phase 2 should monitor with process.cpu_percent() and adjust concurrent session limits if needed.
## Sources
### Primary (HIGH confidence)
- [python-telegram-bot PyPI](https://pypi.org/project/python-telegram-bot/) — Version 22.6, dependencies, API compatibility
- [Python asyncio subprocess documentation](https://docs.python.org/3/library/asyncio-subprocess.html) — Process class, create_subprocess_exec, deadlock warnings
- [Claude Code CLI Reference](https://code.claude.com/docs/en/cli-reference) — CLI flags, --resume, --output-format, --no-interactive
- [Telegram Bot API Documentation](https://core.telegram.org/bots/api) — Rate limits, message format, file handling
- [APScheduler PyPI](https://pypi.org/project/APScheduler/) — Version 3.11.2, AsyncIOScheduler
- [aiofiles PyPI](https://pypi.org/project/aiofiles/) — Version 25.1.0
### Secondary (MEDIUM confidence)
- [OpenClaw Telegram Bot Setup](https://macaron.im/blog/openclaw-telegram-bot-setup) — Session management patterns
- [claude-code-telegram GitHub](https://github.com/RichardAtCT/claude-code-telegram) — Implementation reference
- [Python CPython Issue #115787](https://github.com/python/cpython/issues/115787) — Subprocess PIPE deadlock details
- [Telegram Bots FAQ: Rate Limits](https://core.telegram.org/bots/faq) — API limits
- [Python asyncio Issue #281](https://github.com/python/asyncio/issues/281) — Zombie process patterns
### Tertiary (LOW confidence, needs validation)
- Claude Code CLI stream-json protocol schema — Not documented officially, requires empirical testing
- Claude Code subprocess resource usage — No published benchmarks, monitor in practice
- Optimal message batch timing for Telegram — Requires experimentation with actual Claude output
---
*Research completed: 2026-02-04*
*Ready for roadmap: yes*

View file

@ -13,7 +13,7 @@ This is the management container (VMID 102) for Mikkel's homelab infrastructure.
- **SSH Keys:** Pre-installed for accessing other containers/VMs
- **User:** mikkel (UID 1000, group georgsen GID 1000)
- **Python venv:** ~/venv (activate with `source ~/venv/bin/activate`)
- **Helper scripts:** ~/bin (pve, npm-api, dns, pbs, beszel, kuma, telegram, updates)
- **Helper scripts:** ~/bin (pve, npm-api, dns, pbs, beszel, kuma, telegram)
- **Git repos:** ~/repos
- **Shared storage:** ~/stuff (ZFS bind mount, shared across containers, SMB accessible)
@ -118,36 +118,6 @@ The `~/bin/kuma` script manages Uptime Kuma monitors:
~/bin/kuma resume <id> # Resume monitor
```
## Stalwart Mail Server
The `~/bin/mail` script manages Stalwart Mail Server (VM 200, 65.108.14.164):
```bash
~/bin/mail list # List all mail accounts
~/bin/mail info <email> # Show account details
~/bin/mail create <email> <password> [name] # Create new mail account
~/bin/mail delete <email> # Delete mail account
~/bin/mail passwd <email> <password> # Change account password
~/bin/mail domains # List configured domains
~/bin/mail status # Show server status/version
```
**Active domain:** datalos.dk
**Admin UI:** https://mail.georgsen.dk
**Webmail:** https://webmail.georgsen.dk (Snappymail on Dockge)
**Credentials:** `~/homelab/stalwart/credentials`
## Service Updates
The `~/bin/updates` script checks for and applies updates across all homelab services:
```bash
~/bin/updates check # Check all services for available updates
~/bin/updates update <name|all> [-y] # Update one or more services
```
**Tracked services:** dragonfly, beszel, uptime-kuma, snappymail, stalwart, dockge, npm, forgejo, dns, pbs
Checks Docker image versions (Dockge + NPM), LXC service binaries (Forgejo, Technitium DNS), and apt packages (PBS) against GitHub/Codeberg releases.
## Telegram Bot
Two-way interactive bot for homelab management and communication with Claude.
@ -155,12 +125,6 @@ Two-way interactive bot for homelab management and communication with Claude.
**Bot:** @georgsen_homelab_bot
**Commands (in Telegram):**
- `/new <name> [persona]` - Create new Claude session
- `/session <name>` - Switch to a session
- `/sessions` - List all sessions with status
- `/model <name>` - Switch model (sonnet/opus/haiku or full ID). Persisted per session.
- `/timeout <minutes>` - Set idle timeout (1-120 min, default 10)
- `/archive <name>` - Archive and remove a session
- `/status` - Quick service overview (ping check)
- `/pbs` - PBS backup status
- `/backups` - Last backup per VM/CT
@ -248,55 +212,6 @@ ssh root@10.5.0.254 'pct exec <vmid> -- setcap cap_net_raw+ep /bin/ping'
Note: Must be re-applied after `iputils-ping` package upgrades.
**Tailscale on LXC containers:**
When setting up Tailscale with `--ssh` on an unprivileged LXC container:
1. Stop the container and add TUN device access to `/etc/pve/lxc/<vmid>.conf` on the PVE host:
```
lxc.cgroup2.devices.allow: c 10:200 rwm
lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file
```
2. Start the container, install and enable Tailscale:
```bash
curl -fsSL https://tailscale.com/install.sh | sh
systemctl start tailscaled
tailscale up --ssh
```
3. Move local SSH to port 2222 (Tailscale SSH takes port 22):
```bash
# Update sshd_config
sed -i 's/^#Port 22/Port 2222/' /etc/ssh/sshd_config
# Override ssh.socket (Ubuntu 24.04 uses socket activation)
# ListenStream= clears defaults, then bind explicitly to IPv4
mkdir -p /etc/systemd/system/ssh.socket.d
cat > /etc/systemd/system/ssh.socket.d/override.conf << EOF
[Socket]
ListenStream=
ListenStream=0.0.0.0:2222
EOF
systemctl daemon-reload
systemctl restart ssh.socket ssh.service
```
After setup: local SSH via `ssh -p 2222 user@<ip>`, Tailscale SSH via `ssh user@<hostname>`.
## CRITICAL: Software Versions
**NEVER use version numbers from training data.** Always fetch the latest version dynamically:
```bash
# GitHub releases - get latest tag
curl -s https://api.github.com/repos/OWNER/REPO/releases/latest | jq -r .tag_name
# Or check the project's download page/API
```
Training data is outdated the moment it's created. Hardcoding versions like `v1.27.1` when the latest is `v1.30.0` is unacceptable. Always query the source.
## User Preferences
- Python and Batch for scripting

View file

@ -22,12 +22,6 @@
- [ ] **Build Hoodik Android app** - Hoodik is web-only, create a native Android app for it. Rust backend + Vue frontend, E2E encrypted.
- [ ] **Deploy self-hosted RustDesk server** - Run hbbs+hbbr on core.georgsen.dk for reliable NAT traversal and private relay when connecting from outside LAN. Eliminates dependency on public RustDesk relay servers.
- [ ] **Create dns.services helper script** - API works (credentials in ~/homelab/dns-services/credentials), need to create ~/bin/dns-services helper. Endpoint: `POST /service/{service_id}/dns/{zone_id}/records`. service_id=1389, datalos.dk zone_id=15365.
- [ ] **Add mh.datalos.dk DNS record** - CNAME to core.georgsen.dk (for generic-beregner app on general:3002). NPM proxy already configured (ID 18).
- [ ] **Fix ping on all unprivileged containers** - Run `setcap cap_net_raw+ep /bin/ping` on each container (requires restart or at least root access inside container)
- Containers to fix: 100 (npm), 101 (dockge), 102 (mgmt), 103 (postgresql01), 104 (redis01), 105 (sentry), 107 (pve-scripts-local), 108 (jukebox), 110 (sense), 111 (dev), 112 (dataloes), 114 (forgejo), 115 (dns), 1000 (tailscale)
- Skip: 106 (pbs) - privileged container, 113 (general) - already done

View file

@ -1,2 +0,0 @@
DNS_SERVICES_USER='msgeorgsen@gmail.com'
DNS_SERVICES_PASS='Vy7aWzQeS&pg3Du#MXcKQCi!'

View file

@ -123,11 +123,10 @@ Saved with: `netfilter-persistent save`
| Type | VM (KVM) |
| IP | 65.108.14.164 (dedicated public IP) |
| Bridge | vmbr0 (direct) |
| Software | Stalwart Mail Server 0.15.4 |
| Disk | 32GB |
| Software | Stalwart Mail Server |
| Webmail | Snappymail (via dockge) |
**Active domain:** datalos.dk
**Current domains:** dataloes.dk (building reputation before adding more)
**Planned domains:** georgsen.dk, microsux.dk, dataloes.dk
@ -142,22 +141,19 @@ Saved with: `netfilter-persistent save`
| 100 | npm | 10.5.0.1 | Nginx Proxy Manager | Running |
| 101 | dockge | 10.5.0.10 | Docker Compose Manager | Running |
| 102 | mgmt | 10.5.0.108 | Management/Automation (Claude Code) | Running |
| 103 | postgresql01 | 10.5.0.109 | PostgreSQL (community) | Running |
| 104 | redis01 | 10.5.0.111 | Redis (community) | Running |
| 105 | sentry | 10.5.0.168 | Defense Intelligence System | Running |
| 103 | postgresql01 | DHCP | PostgreSQL (community) | Running |
| 104 | redis01 | DHCP | Redis (community) | Running |
| 105 | sentry | DHCP | Defense Intelligence System | Running |
| 106 | pbs | 10.5.0.6 | Proxmox Backup Server | Running |
| 107 | pve-scripts-local | 10.5.0.110 | Community Scripts Web UI | Running |
| 108 | jukebox | 10.5.0.184 | Music Player (custom project) | Running |
| 107 | pve-scripts-local | DHCP | Community Scripts Web UI | Running |
| 108 | jukebox | DHCP (→10.5.0.184) | Music Player (custom project) | Running |
| 110 | sense.microsux.dk | DHCP | CBD Vendor Locator | Stopped |
| 111 | dev | 10.5.0.153 | Development container | Running |
| 112 | dataloes | 10.5.0.112 | dataloes.dk website | Running |
| 113 | general | 10.5.0.113 | General purpose container | Running |
| 111 | dev | DHCP | Development container | Running |
| 112 | dataloes | 10.5.0.112 | dataloes.dk website | Stopped |
| 113 | general | 10.5.0.113 | Decomissioned | Stopped |
| 114 | forgejo | 10.5.0.14 | Git server (Forgejo) | Running |
| 115 | dns | 10.5.0.2 | DNS server (Technitium) | Running |
| 116 | lisotex | 10.5.0.116 | lisotex.dk website | Running |
| 117 | nexus | 10.5.0.17 | Nexus (Tailscale SSH) | Running |
| 120 | debate-builder | 10.5.0.171 | Debate builder app (KVM) | Running |
| 1000 | tailscale | 10.5.0.134 + 10.9.1.10 | Tailscale relay | Running |
| 1000 | tailscale | 10.5.0.x + 10.9.1.10 | Tailscale relay | Running |
### Container Details
@ -184,24 +180,16 @@ cd /opt/npm && docker compose pull && docker compose up -d
| dockge.georgsen.dk | http://10.5.0.10:5001 | Let's Encrypt |
| git.georgsen.dk | http://10.5.0.14:3000 | Let's Encrypt |
| jukebox.georgsen.dk | http://10.5.0.184:4000 | Let's Encrypt |
| lisotex.dk, *.lisotex.dk | http://10.5.0.116:3000 | Pending |
| lisoflex.lisotex.dk | http://10.5.0.116:4000 | Pending |
| lisotex.datalos.dk | http://10.5.0.116:3000 | Pending |
| pbs.georgsen.dk | https://10.5.0.6:8007 | Let's Encrypt |
| status.georgsen.dk | http://10.5.0.10:3001 | Let's Encrypt |
| webmail.georgsen.dk | http://10.5.0.10:8888 | Let's Encrypt |
| dashboard.georgsen.dk | http://10.5.0.10:8090 | Let's Encrypt |
| obsidian.georgsen.dk | http://10.5.0.10:8280 | Let's Encrypt |
| obs.georgsen.dk | http://10.5.0.10:8280 | Let's Encrypt |
| obsidian-sync.georgsen.dk | http://10.5.0.10:5984 | Let's Encrypt |
#### 101: Dockge
- **Purpose:** Docker Compose stack management
- **IP:** 10.5.0.10
- **Port:** 5001
- **LXC extras:** `lxc.prlimit.memlock: unlimited` (required for DragonflyDB ulimits in unprivileged container)
- **SSH:** root key installed for mgmt (102) access
**Running Stacks:**
```yaml
@ -237,43 +225,6 @@ services:
- 8090:8090
volumes:
- ./data:/beszel_data
# DragonflyDB (in-memory datastore, Redis-compatible)
services:
dragonfly:
image: docker.dragonflydb.io/dragonflydb/dragonfly:latest
container_name: dragonfly
restart: unless-stopped
ports:
- 6379:6379
volumes:
- ./data:/data
ulimits:
memlock: -1
command: ["--requirepass", "nUq/IfoIQJf/kouckKHRQOk7vV0NwCuI"]
# Password: nUq/IfoIQJf/kouckKHRQOk7vV0NwCuI
# Connect: redis-cli -h 10.5.0.10 -p 6379 -a 'nUq/IfoIQJf/kouckKHRQOk7vV0NwCuI'
# Obsidian (web-based editor + LiveSync)
services:
obsidian:
image: lscr.io/linuxserver/obsidian:latest
container_name: obsidian
ports:
- 8280:3000
- 8281:3001
volumes:
- ./obsidian-config:/config
couchdb:
image: couchdb:latest
container_name: couchdb-livesync
ports:
- 5984:5984
volumes:
- ./couchdb-data:/opt/couchdb/data
- ./couchdb-etc:/opt/couchdb/etc/local.d
# CouchDB credentials: ~/homelab/obsidian/credentials
# LiveSync database: obsidian-livesync
```
#### 105: Sentry (Defense Intelligence)
@ -450,7 +401,6 @@ Requires=mnt-synology.mount
| xanderryzen | 100.71.118.78 | |
| nvr01 | 100.118.17.103 | Exit node |
| tailscalemg | 100.115.101.65 | Exit node |
| nexus | 100.126.46.74 | |
**Tailscale config:** SSH enabled on all devices where possible
@ -484,20 +434,7 @@ Requires=mnt-synology.mount
| dockge | 10.5.0.10 |
| forgejo | 10.5.0.14 |
| git | 10.5.0.14 |
| nexus | 10.5.0.17 |
| mgmt | 10.5.0.108 |
| postgresql01 | 10.5.0.109 |
| pve-scripts | 10.5.0.110 |
| redis01 | 10.5.0.111 |
| lisotex | 10.5.0.116 |
| tailscale | 10.5.0.134 |
| dev | 10.5.0.153 |
| sentry | 10.5.0.168 |
| debate-builder | 10.5.0.171 |
| jukebox | 10.5.0.184 |
| obsidian | 10.5.0.10 |
| obs | 10.5.0.10 |
| obsidian-sync | 10.5.0.10 |
---
@ -551,23 +488,6 @@ chown -R mikkel:georgsen /home/mikkel/.ssh
setcap cap_net_raw+ep /bin/ping
```
### Tailscale in LXC Containers
Unprivileged LXC containers need TUN device access for Tailscale. Add to the container config on the PVE host (`/etc/pve/lxc/<vmid>.conf`):
```
lxc.cgroup2.devices.allow: c 10:200 rwm
lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file
```
Container must be stopped before adding these lines. Then inside the container:
```bash
curl -fsSL https://tailscale.com/install.sh | sh
systemctl start tailscaled
tailscale up --ssh
```
---
## Projects
@ -608,7 +528,7 @@ Personal company website
| ID | Name | IPs | Applied To |
|----|------|-----|------------|
| 1 | home_only | 83.89.248.247 | dns.georgsen.dk, dockge.georgsen.dk, pbs.georgsen.dk, obsidian.georgsen.dk, obs.georgsen.dk, obsidian-sync.georgsen.dk |
| 1 | home_only | 83.89.248.247 | dns.georgsen.dk, dockge.georgsen.dk, pbs.georgsen.dk |
### Fail2ban
@ -640,7 +560,9 @@ Personal company website
```
2. **Containers to evaluate:**
- 110 (sense.microsux.dk) - Stopped, consider consolidating
- 110 (sense.microsux.dk) - Consider consolidating
- 112 (dataloes) - Stopped
- 113 (general) - Decomissioned, can remove
3. **DHCP vs Static IPs:**
- Containers .112 and .113 have static IPs inside DHCP range (100-200)
@ -664,8 +586,6 @@ Personal company website
| Webmail | https://webmail.georgsen.dk |
| JukeBox | https://jukebox.georgsen.dk |
| Dashboard | https://dashboard.georgsen.dk or http://10.5.0.10:8090 |
| Obsidian | https://obsidian.georgsen.dk or http://10.5.0.10:8280 |
| Obsidian Sync | https://obsidian-sync.georgsen.dk or http://10.5.0.10:5984 |
### Important IPs
@ -678,13 +598,6 @@ Personal company website
| PBS | 10.5.0.6 |
| Dockge | 10.5.0.10 |
| Forgejo | 10.5.0.14 |
| mgmt | 10.5.0.108 |
| PostgreSQL | 10.5.0.109 |
| redis01 | 10.5.0.111 |
| lisotex | 10.5.0.116 |
| dev | 10.5.0.153 |
| sentry | 10.5.0.168 |
| jukebox | 10.5.0.184 |
| Synology (Tailscale) | 100.105.26.130 |
| PBS (Tailscale) | 100.115.85.120 |
@ -723,13 +636,6 @@ Personal company website
- **Config:** ~/homelab/npm/npm-api.conf (symlinked)
- **Helper:** ~/bin/npm-api (--host-list, --host-create, --host-delete, --cert-list)
### DragonflyDB (from mgmt container)
- **Host:** 10.5.0.10:6379 (Docker in Dockge)
- **Protocol:** Redis-compatible (use redis-cli or any Redis client library)
- **Password:** `nUq/IfoIQJf/kouckKHRQOk7vV0NwCuI`
- **Connect:** `redis-cli -h 10.5.0.10 -p 6379 -a 'nUq/IfoIQJf/kouckKHRQOk7vV0NwCuI'`
### DNS API (from mgmt container)
- **Config:** ~/homelab/dns/credentials (symlinked to ~/.config/dns)
@ -743,54 +649,4 @@ Personal company website
---
## Incident Log
### 2026-01-12: Hetzner MAC Address Warning (Incident)
**Ticket:** #2760303
**Received:** 2026-01-12
**Investigated:** 2026-01-22
**Issue:** Hetzner detected unallowed MAC addresses on the WAN interface (vmbr0).
**Unallowed MACs:**
- `bc:24:11:0f:6b:7c`
- `bc:24:11:74:1c:72`
**Allowed MACs:**
- `a8:a1:59:8e:72:c3` (physical NIC enp9s0)
- `00:50:56:00:04:21` (VM 200 mail server)
**Investigation:**
- All current LXC containers are on vmbr1 (internal), not vmbr0
- The flagged MACs follow Proxmox LXC naming convention (`bc:24:11`) but don't match any current container
- No `bc:24:11` MACs visible on enp9s0 in live packet capture
- Mail VM (200) has correct MAC, no Docker installed
- DNAT/MASQUERADE properly isolates internal traffic
**Root cause:** Unknown. Likely from deleted containers during infrastructure rebuild, or brief misconfiguration during setup.
**Resolution:** Current configuration verified correct. Response sent to Hetzner explaining setup and that flagged MACs are not recognized.
---
### 2026-01-13: BSI Portmapper Warning (Incident)
**Source:** German Federal Office for Information Security (BSI) via Hetzner
**Issue:** Port 111 (portmapper/rpcbind) was accessible from the internet, potentially usable for DDoS reflection attacks.
**Scan timestamp:** 2026-01-13 01:37:40 UTC
**Timeline:**
- 2026-01-11: Firewall rules file created
- 2026-01-13 01:37:40: BSI scan detected open port 111
- 2026-01-14 14:58:07: Firewall rules properly configured and saved
**Resolution:** Port 111 is now blocked on vmbr0 (home IP whitelisted). The scan occurred before the fix was applied. No further action needed - future scans should show port as closed.
**Current status:** Verified blocked via iptables rules (76 UDP, 462 TCP packets dropped as of 2026-01-22).
---
*Last updated: 2026-02-12*
*Last updated: 2026-01-14*

View file

@ -1,7 +0,0 @@
CouchDB LiveSync Credentials
=============================
URL: http://10.5.0.10:5984
Public URL: https://obsidian-sync.georgsen.dk
Database: obsidian-livesync
Username: obsidian
Password: nmJdWsRCPY49lPWl4NVKuKeF

@ -1 +0,0 @@
Subproject commit 96dc1eb4994ef12ac538782f0da1aa736d7dfb27

View file

@ -1,3 +0,0 @@
STALWART_URL=https://mail.georgsen.dk
STALWART_ADMIN_USER=admin
STALWART_ADMIN_PASS=NfDB1p7rxqVGH8nPTPmK

File diff suppressed because it is too large Load diff

View file

@ -1,561 +0,0 @@
"""
Claude Code subprocess management for Telegram bot.
This module provides the ClaudeSubprocess class that maintains a persistent
Claude Code CLI subprocess using stream-json I/O for efficient multi-turn
conversations without respawning per turn.
The persistent subprocess accepts NDJSON messages on stdin, emits stream-json
events on stdout, and maintains conversation context across turns. Stdout and
stderr are read concurrently via asyncio.gather to prevent pipe buffer deadlocks.
Key features:
- Persistent subprocess with stream-json stdin/stdout (eliminates ~1s spawn overhead)
- Concurrent stdout/stderr reading (no pipe deadlock)
- Clean process termination (no zombie processes)
- Message queuing during processing
- Automatic crash recovery with --continue flag
- Stream-json event routing to callbacks (including tool_use events)
Based on research in: .planning/phases/02-telegram-integration/02-RESEARCH.md
"""
import asyncio
import inspect
import json
import logging
import os
import time
from pathlib import Path
from typing import Callable, Optional
logger = logging.getLogger(__name__)
class ClaudeSubprocess:
"""
Manages a persistent Claude Code CLI subprocess with stream-json I/O.
Spawns Claude Code once and maintains the process across multiple turns,
accepting NDJSON messages on stdin. This eliminates ~1s spawn overhead per
message and preserves conversation context.
Example:
sub = ClaudeSubprocess(
session_dir=Path("/home/mikkel/telegram/sessions/my-session"),
persona={"system_prompt": "You are a helpful assistant"},
on_output=lambda text: print(f"Claude: {text}"),
on_error=lambda err: print(f"Error: {err}"),
on_complete=lambda: print("Turn complete"),
on_status=lambda status: print(f"Status: {status}"),
on_tool_use=lambda name, inp: print(f"Tool: {name} {inp}")
)
await sub.start()
await sub.send_message("Hello, Claude!")
await sub.send_message("What's the weather?")
# ... later ...
await sub.terminate()
"""
MAX_CRASH_RETRIES = 3
CRASH_BACKOFF_SECONDS = 1
def __init__(
self,
session_dir: Path,
persona: Optional[dict] = None,
on_output: Optional[Callable[[str], None]] = None,
on_error: Optional[Callable[[str], None]] = None,
on_complete: Optional[Callable[[], None]] = None,
on_status: Optional[Callable[[str], None]] = None,
on_tool_use: Optional[Callable[[str, dict], None]] = None,
):
"""
Initialize ClaudeSubprocess.
Args:
session_dir: Path to session directory (cwd for subprocess)
persona: Persona dict with system_prompt and settings (model, max_turns)
on_output: Callback(text: str) for assistant text output
on_error: Callback(error: str) for error messages
on_complete: Callback() when a turn completes
on_status: Callback(status: str) for status updates (e.g. "Claude restarted")
on_tool_use: Callback(tool_name: str, tool_input: dict) for tool call progress
"""
self._session_dir = Path(session_dir)
self._persona = persona or {}
self.on_output = on_output
self.on_error = on_error
self.on_complete = on_complete
self.on_status = on_status
self.on_tool_use = on_tool_use
# Process state
self._process: Optional[asyncio.subprocess.Process] = None
self._busy = False
self._message_queue: asyncio.Queue = asyncio.Queue()
self._stdout_reader_task: Optional[asyncio.Task] = None
self._stderr_reader_task: Optional[asyncio.Task] = None
self._crash_count = 0
logger.debug(
f"ClaudeSubprocess initialized: session_dir={session_dir}, "
f"persona={persona.get('system_prompt', 'none')[:50] if persona else 'none'}"
)
@property
def is_busy(self) -> bool:
"""Return whether subprocess is currently processing a message."""
return self._busy
@property
def is_alive(self) -> bool:
"""Return whether subprocess process is running."""
return self._process is not None and self._process.returncode is None
@property
def pid(self) -> Optional[int]:
"""
Return process ID of running subprocess.
Returns:
PID if process is running, None otherwise
"""
return self._process.pid if self._process and self._process.returncode is None else None
async def start(self) -> None:
"""
Start the persistent Claude Code subprocess.
Spawns process with stream-json I/O and launches background readers.
Must be called before send_message().
"""
if self.is_alive:
logger.warning("Subprocess already running")
return
self._spawn_time = time.monotonic()
self._first_output_time = None
# Build command for persistent process
cmd = [
"claude",
"-p",
"--input-format", "stream-json",
"--output-format", "stream-json",
"--verbose",
"--dangerously-skip-permissions",
]
# Add --continue if prior session exists
if (self._session_dir / ".claude").exists():
cmd.append("--continue")
logger.debug("Using --continue flag (found existing .claude/ directory)")
# Add persona settings (model FIRST, then system prompt)
if self._persona:
settings = self._persona.get("settings", {})
if "model" in settings:
cmd.extend(["--model", settings["model"]])
if "max_turns" in settings:
cmd.extend(["--max-turns", str(settings["max_turns"])])
if "system_prompt" in self._persona:
cmd.extend(["--append-system-prompt", self._persona["system_prompt"]])
# Prepare environment
env = os.environ.copy()
# Ensure PATH includes ~/.local/bin and ~/bin (for claude CLI)
path_parts = env.get("PATH", "").split(":")
home_bin = str(Path.home() / "bin")
local_bin = str(Path.home() / ".local" / "bin")
if home_bin not in path_parts:
path_parts.insert(0, home_bin)
if local_bin not in path_parts:
path_parts.insert(0, local_bin)
env["PATH"] = ":".join(path_parts)
# Ensure session directory exists
self._session_dir.mkdir(parents=True, exist_ok=True)
# Log full command
logger.info(
f"[TIMING] Starting persistent subprocess: cwd={self._session_dir.name}, "
f"cmd={' '.join(cmd)}"
)
try:
# Spawn subprocess (10MB stdout limit for large stream-json lines e.g. image tool results)
self._process = await asyncio.create_subprocess_exec(
*cmd,
stdin=asyncio.subprocess.PIPE,
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
cwd=str(self._session_dir),
env=env,
limit=10 * 1024 * 1024,
)
elapsed = time.monotonic() - self._spawn_time
logger.info(f"[TIMING] Persistent subprocess started: PID={self._process.pid} (+{elapsed:.3f}s)")
# Launch background readers (persistent, not per-turn)
self._stdout_reader_task = asyncio.create_task(self._read_stdout())
self._stderr_reader_task = asyncio.create_task(self._read_stderr())
except Exception as e:
logger.error(f"Failed to spawn Claude Code: {e}")
if self.on_error:
self.on_error(f"Failed to start Claude: {e}")
async def send_message(self, message: str) -> None:
"""
Send a message to the persistent Claude Code subprocess.
Writes NDJSON message to stdin. If process not started, starts it first.
If busy, queues the message.
Args:
message: Message to send to Claude
"""
# Auto-start if not running
if not self.is_alive:
logger.debug("Process not running, starting it first")
await self.start()
if not self._process or not self._process.stdin:
raise RuntimeError("Subprocess not running or stdin not available")
# Queue if busy
if self._busy:
logger.debug(f"Queueing message (process busy): {message[:50]}...")
await self._message_queue.put(message)
return
# Mark as busy
self._busy = True
self._send_time = time.monotonic()
# Write NDJSON to stdin
try:
msg_dict = {"type": "user", "message": {"role": "user", "content": message}}
ndjson_line = json.dumps(msg_dict) + '\n'
self._process.stdin.write(ndjson_line.encode())
await self._process.stdin.drain() # CRITICAL: flush buffer
logger.debug(f"Sent message to stdin: {message[:60]}...")
except Exception as e:
logger.error(f"Failed to send message to stdin: {e}")
self._busy = False
if self.on_error:
self.on_error(f"Failed to send message: {e}")
async def _read_stdout(self) -> None:
"""
Read stdout stream-json events persistently.
Runs for the lifetime of the process, not per-turn. Exits only when
process dies (readline returns empty bytes).
"""
if not self._process or not self._process.stdout:
return
first_line = True
try:
while True:
line = await self._process.stdout.readline()
if not line:
# Process died
logger.warning("Stdout stream ended (process died)")
break
line_str = line.decode().rstrip()
if line_str:
if first_line:
elapsed = time.monotonic() - self._spawn_time
logger.info(f"[TIMING] First stdout line: +{elapsed:.3f}s")
first_line = False
await self._handle_stdout_line(line_str)
except Exception as e:
logger.error(f"Error reading stdout: {e}")
finally:
# Process died unexpectedly - trigger crash recovery
if self.is_alive and self._process.returncode is None:
# Process still running but stdout closed - unusual
logger.error("Stdout closed but process still running")
elif self._process and self._process.returncode is not None and self._process.returncode > 0:
# Process crashed
await self._handle_crash()
async def _read_stderr(self) -> None:
"""
Read stderr persistently.
Runs for the lifetime of the process. Exits only when process dies.
"""
if not self._process or not self._process.stderr:
return
try:
while True:
line = await self._process.stderr.readline()
if not line:
# Process died
break
line_str = line.decode().rstrip()
if line_str:
self._handle_stderr_line(line_str)
except Exception as e:
logger.error(f"Error reading stderr: {e}")
async def _handle_stdout_line(self, line: str) -> None:
"""
Parse and route stream-json events from stdout.
Handles event types:
- "assistant": Extract text blocks and call on_output
- "content_block_start": Extract tool_use events and call on_tool_use
- "content_block_delta": Handle tool input streaming (future)
- "result": Turn complete, mark not busy, call on_complete
- "system": System events, log and check for errors
Args:
line: Single line of stream-json output
"""
try:
event = json.loads(line)
event_type = event.get("type")
if event_type == "assistant":
# Extract text from assistant message
message = event.get("message", {})
model = message.get("model", "unknown")
logger.info(f"Assistant response model: {model}")
content = message.get("content", [])
for block in content:
if block.get("type") == "text":
text = block.get("text", "")
if text and self.on_output:
if not self._first_output_time:
self._first_output_time = time.monotonic()
elapsed = self._first_output_time - self._spawn_time
logger.info(f"[TIMING] First assistant text: +{elapsed:.3f}s ({len(text)} chars)")
try:
if inspect.iscoroutinefunction(self.on_output):
asyncio.create_task(self.on_output(text))
else:
self.on_output(text)
except Exception as e:
logger.error(f"Error in on_output callback: {e}")
elif event_type == "content_block_start":
# Check for tool_use block
content_block = event.get("content_block", {})
if content_block.get("type") == "tool_use":
tool_name = content_block.get("name", "unknown")
tool_input = content_block.get("input", {})
logger.debug(f"Tool use started: {tool_name} with input {tool_input}")
if self.on_tool_use:
try:
if inspect.iscoroutinefunction(self.on_tool_use):
asyncio.create_task(self.on_tool_use(tool_name, tool_input))
else:
self.on_tool_use(tool_name, tool_input)
except Exception as e:
logger.error(f"Error in on_tool_use callback: {e}")
elif event_type == "content_block_delta":
# Tool input streaming (not needed for initial implementation)
pass
elif event_type == "result":
# Turn complete - mark not busy and call on_complete
if hasattr(self, '_send_time'):
elapsed = time.monotonic() - self._send_time
else:
elapsed = time.monotonic() - self._spawn_time
session_id = event.get("session_id")
logger.info(f"[TIMING] Result event: +{elapsed:.3f}s (session={session_id})")
# Check for error
if event.get("is_error") and self.on_error:
error_msg = event.get("error", "Unknown error")
try:
if inspect.iscoroutinefunction(self.on_error):
asyncio.create_task(self.on_error(f"Claude error: {error_msg}"))
else:
self.on_error(f"Claude error: {error_msg}")
except Exception as e:
logger.error(f"Error in on_error callback: {e}")
# Mark not busy
self._busy = False
# Call completion callback
if self.on_complete:
try:
if inspect.iscoroutinefunction(self.on_complete):
asyncio.create_task(self.on_complete())
else:
self.on_complete()
except Exception as e:
logger.error(f"Error in on_complete callback: {e}")
# Process queued messages
if not self._message_queue.empty():
next_message = await self._message_queue.get()
logger.debug(f"Processing queued message: {next_message[:50]}...")
await self.send_message(next_message)
elif event_type == "system":
# System event
subtype = event.get("subtype")
logger.debug(f"System event: subtype={subtype}")
# Check for error subtype
if subtype == "error" and self.on_error:
error_msg = event.get("message", "System error")
try:
if inspect.iscoroutinefunction(self.on_error):
asyncio.create_task(self.on_error(f"System error: {error_msg}"))
else:
self.on_error(f"System error: {error_msg}")
except Exception as e:
logger.error(f"Error in on_error callback: {e}")
else:
logger.debug(f"Unknown event type: {event_type}")
except json.JSONDecodeError:
# Non-JSON line (Claude Code may emit diagnostics)
logger.warning(f"Non-JSON stdout line: {line[:100]}")
def _handle_stderr_line(self, line: str) -> None:
"""
Handle stderr output from Claude Code.
Logs as warning. If line contains "error" (case-insensitive),
also calls on_error callback.
Args:
line: Single line of stderr output
"""
logger.warning(f"Claude Code stderr: {line}")
# If line contains error, notify via callback
if "error" in line.lower() and self.on_error:
try:
if inspect.iscoroutinefunction(self.on_error):
asyncio.create_task(self.on_error(f"Claude stderr: {line}"))
else:
self.on_error(f"Claude stderr: {line}")
except Exception as e:
logger.error(f"Error in on_error callback: {e}")
async def _handle_crash(self) -> None:
"""
Handle Claude Code crash (non-zero exit).
Attempts to restart persistent process with --continue flag up to
MAX_CRASH_RETRIES times. Notifies user via on_status callback.
"""
self._crash_count += 1
logger.error(
f"Claude Code crashed (attempt {self._crash_count}/{self.MAX_CRASH_RETRIES})"
)
if self._crash_count >= self.MAX_CRASH_RETRIES:
error_msg = f"Claude failed to restart after {self.MAX_CRASH_RETRIES} attempts"
logger.error(error_msg)
if self.on_error:
if inspect.iscoroutinefunction(self.on_error):
asyncio.create_task(self.on_error(error_msg))
else:
self.on_error(error_msg)
self._crash_count = 0 # Reset for next session
self._busy = False
return
# Notify user
if self.on_status:
try:
if inspect.iscoroutinefunction(self.on_status):
asyncio.create_task(self.on_status("Claude crashed, restarting with context preserved..."))
else:
self.on_status("Claude crashed, restarting with context preserved...")
except Exception as e:
logger.error(f"Error in on_status callback: {e}")
# Wait before retrying
await asyncio.sleep(self.CRASH_BACKOFF_SECONDS)
# Restart persistent process
try:
await self.start()
# Resend queued messages if any
if not self._message_queue.empty():
next_message = await self._message_queue.get()
logger.debug(f"Resending queued message after crash: {next_message[:50]}...")
await self.send_message(next_message)
except Exception as e:
logger.error(f"Failed to restart after crash: {e}")
if self.on_error:
self.on_error(f"Failed to restart Claude: {e}")
self._busy = False
async def terminate(self, timeout: int = 10) -> None:
"""
Terminate persistent subprocess gracefully.
Closes stdin, sends SIGTERM, waits for clean exit with timeout, then
SIGKILL if needed. Always reaps process to prevent zombies.
Args:
timeout: Seconds to wait for graceful termination before SIGKILL
"""
if not self._process or self._process.returncode is not None:
logger.debug("No process to terminate or already terminated")
return
logger.info(f"Terminating Claude Code process: PID={self._process.pid}")
try:
# Close stdin to signal end of input
if self._process.stdin:
self._process.stdin.close()
await self._process.stdin.wait_closed()
# Send SIGTERM
self._process.terminate()
# Wait for graceful exit with timeout
try:
await asyncio.wait_for(self._process.wait(), timeout=timeout)
logger.info("Claude Code terminated gracefully")
except asyncio.TimeoutError:
# Timeout - force kill
logger.warning(
f"Claude Code did not terminate within {timeout}s, sending SIGKILL"
)
self._process.kill()
await self._process.wait() # CRITICAL: Always reap to prevent zombie
logger.info("Claude Code killed")
except Exception as e:
logger.error(f"Error terminating process: {e}")
finally:
# Clear process reference
self._process = None
self._busy = False
# Cancel reader tasks if still running
for task in [self._stdout_reader_task, self._stderr_reader_task]:
if task and not task.done():
task.cancel()
try:
await task
except asyncio.CancelledError:
pass

View file

@ -1,147 +0,0 @@
"""
Session idle timer management for Claude Code Telegram bot.
This module provides the SessionIdleTimer class that manages per-session idle
timeouts using asyncio. Each session has its own timer that fires a callback
after a configurable timeout period. Timers reset on activity and can be
cancelled on session shutdown/archive.
Example:
async def handle_timeout(session_name: str):
print(f"Session {session_name} timed out")
timer = SessionIdleTimer("my-session", timeout_seconds=600, on_timeout=handle_timeout)
timer.reset() # Start the timer
# ... activity occurs ...
timer.reset() # Reset timer on activity
# ... no activity for 600 seconds ...
# handle_timeout("my-session") called automatically
"""
import asyncio
import logging
from datetime import datetime, timezone
from typing import Awaitable, Callable, Optional
logger = logging.getLogger(__name__)
class SessionIdleTimer:
"""
Manages idle timeout for a single session.
Provides per-session timeout detection with automatic callback firing
after inactivity. Timer can be reset on activity and cancelled on shutdown.
Attributes:
session_name: Name of the session this timer tracks
timeout_seconds: Idle timeout in seconds
on_timeout: Async callback to invoke when timeout fires
last_activity: UTC timestamp of last activity
seconds_since_activity: Float seconds since last activity
"""
def __init__(
self,
session_name: str,
timeout_seconds: int,
on_timeout: Callable[[str], Awaitable[None]]
):
"""
Initialize SessionIdleTimer.
Args:
session_name: Name of the session to track
timeout_seconds: Idle timeout in seconds
on_timeout: Async callback(session_name) to invoke when timeout fires
"""
self.session_name = session_name
self.timeout_seconds = timeout_seconds
self.on_timeout = on_timeout
self._timer_task: Optional[asyncio.Task] = None
self._last_activity = datetime.now(timezone.utc)
logger.debug(
f"SessionIdleTimer initialized: session={session_name}, "
f"timeout={timeout_seconds}s"
)
def reset(self) -> None:
"""
Reset the idle timer.
Updates last activity timestamp to now, cancels any existing timer task,
and creates a new background task that will fire the timeout callback
after timeout_seconds of inactivity.
Call this whenever activity occurs on the session.
"""
# Update last activity timestamp
self._last_activity = datetime.now(timezone.utc)
# Cancel existing timer if running
if self._timer_task and not self._timer_task.done():
self._timer_task.cancel()
logger.debug(f"Cancelled existing timer for session '{self.session_name}'")
# Create new timer task
self._timer_task = asyncio.create_task(self._wait_for_timeout())
logger.debug(
f"Started idle timer for session '{self.session_name}': "
f"{self.timeout_seconds}s"
)
async def _wait_for_timeout(self) -> None:
"""
Background task that waits for timeout then fires callback.
Sleeps for timeout_seconds, then invokes on_timeout callback with
session name. Catches asyncio.CancelledError silently (timer was reset).
"""
try:
await asyncio.sleep(self.timeout_seconds)
# Timeout fired - call callback
logger.info(
f"Session '{self.session_name}' idle timeout fired "
f"({self.timeout_seconds}s)"
)
await self.on_timeout(self.session_name)
except asyncio.CancelledError:
# Timer was reset or cancelled - this is normal
logger.debug(f"Timer cancelled for session '{self.session_name}'")
def cancel(self) -> None:
"""
Cancel the idle timer.
Stops the background timer task if running. Used on session shutdown
or archive to prevent timeout callback from firing.
"""
if self._timer_task and not self._timer_task.done():
self._timer_task.cancel()
logger.debug(f"Cancelled timer for session '{self.session_name}'")
@property
def seconds_since_activity(self) -> float:
"""
Get seconds since last activity.
Returns:
Float seconds elapsed since last reset() call
"""
now = datetime.now(timezone.utc)
delta = now - self._last_activity
return delta.total_seconds()
@property
def last_activity(self) -> datetime:
"""
Get timestamp of last activity.
Returns:
UTC datetime of last reset() call
"""
return self._last_activity

View file

@ -1,127 +0,0 @@
"""
Message batching with debounce for Telegram bot.
Collects rapid sequential messages and combines them into a single prompt
after a configurable debounce period of silence.
Based on research in: .planning/phases/02-telegram-integration/02-RESEARCH.md
"""
import asyncio
import logging
from typing import Callable
logger = logging.getLogger(__name__)
class MessageBatcher:
"""
Batches rapid sequential messages with debounce timer.
When messages arrive in quick succession, waits for a period of silence
(debounce_seconds) before flushing all queued messages as a single batch
via the callback function.
Example:
async def handle_batch(combined: str):
await process_message(combined)
batcher = MessageBatcher(callback=handle_batch, debounce_seconds=2.0)
await batcher.add_message("one")
await batcher.add_message("two") # Resets timer
await batcher.add_message("three") # Resets timer
# After 2s of silence, callback receives: "one\n\ntwo\n\nthree"
"""
def __init__(self, callback: Callable[[str], None], debounce_seconds: float = 2.0):
"""
Initialize MessageBatcher.
Args:
callback: Async function to call with combined message string
debounce_seconds: Seconds of silence before flushing batch
"""
self._callback = callback
self._debounce_seconds = debounce_seconds
self._queue = asyncio.Queue()
self._timer_task: asyncio.Task | None = None
self._lock = asyncio.Lock()
logger.debug(f"MessageBatcher initialized: debounce={debounce_seconds}s")
async def add_message(self, message: str) -> None:
"""
Add message to batch and reset debounce timer.
Args:
message: Message text to batch
"""
async with self._lock:
# Add message to queue
await self._queue.put(message)
logger.debug(f"Message queued (size={self._queue.qsize()}): {message[:50]}...")
# Cancel previous timer if running
if self._timer_task and not self._timer_task.done():
self._timer_task.cancel()
try:
await self._timer_task
except asyncio.CancelledError:
pass
# Start new timer
self._timer_task = asyncio.create_task(self._debounce_timer())
async def _debounce_timer(self) -> None:
"""
Wait for debounce period, then flush batch.
Runs as a task that can be cancelled when new messages arrive.
"""
try:
await asyncio.sleep(self._debounce_seconds)
await self._flush_batch()
except asyncio.CancelledError:
# Timer was cancelled by new message, not an error
logger.debug("Debounce timer cancelled (new message arrived)")
async def _flush_batch(self) -> None:
"""
Combine all queued messages and call callback.
Joins messages with double newline separator.
"""
async with self._lock:
# Collect all messages from queue
messages = []
while not self._queue.empty():
messages.append(await self._queue.get())
if not messages:
return
# Combine with double newline
combined = "\n\n".join(messages)
logger.info(f"Flushing batch: {len(messages)} messages -> {len(combined)} chars")
# Call callback
try:
await self._callback(combined)
except Exception as e:
logger.error(f"Error in batch callback: {e}")
async def flush_immediately(self) -> None:
"""
Flush batch immediately without waiting for debounce timer.
Useful when switching sessions or shutting down.
"""
# Cancel timer
if self._timer_task and not self._timer_task.done():
self._timer_task.cancel()
try:
await self._timer_task
except asyncio.CancelledError:
pass
# Flush
await self._flush_batch()

View file

@ -1,9 +0,0 @@
{
"name": "Brainstorm",
"description": "Creative ideation leading to BRAINSTORM.md",
"system_prompt": "You are in brainstorming mode. Your goal is to help shape a vision and produce a BRAINSTORM.md file.\n\nGenerate ideas freely. Build on previous ideas. Explore unconventional approaches. Ask probing questions to understand the problem space.\n\nGuide the conversation toward defining:\n- **Vision**: What is this project and why does it matter?\n- **MVP features**: The absolute minimum to validate the idea\n- **Launch / v1.0 features**: What makes it complete and useful\n- **v2.0 features**: Future enhancements and nice-to-haves\n\nWhen the vision feels solid, produce BRAINSTORM.md with these sections. Use bullet lists for features. Don't worry about technical feasibility — that comes later when the planner persona takes over.\n\nThe planner persona will consume BRAINSTORM.md to produce PRD, BRD, and SoW — so make the vision clear and the feature sets well-defined.",
"settings": {
"model": "sonnet",
"max_turns": 50
}
}

View file

@ -1,9 +0,0 @@
{
"name": "Default",
"description": "All-around helpful assistant",
"system_prompt": "You are Claude, a helpful AI assistant. Be concise, practical, and direct. Help with whatever is asked — coding, writing, analysis, problem-solving, or just conversation. Adapt your tone and depth to the question.",
"settings": {
"model": "claude-sonnet-4-5-20250929",
"max_turns": 25
}
}

View file

@ -1,9 +0,0 @@
{
"name": "Homelab",
"description": "Homelab infrastructure management assistant",
"system_prompt": "You are Claude, an AI assistant helping Mikkel manage his homelab infrastructure. You have full access to the management container's tools and can SSH to other containers. Be helpful, thorough, and proactive about suggesting improvements. When making changes, explain what you're doing and why.",
"settings": {
"model": "sonnet",
"max_turns": 25
}
}

View file

@ -1,9 +0,0 @@
{
"name": "Planner",
"description": "Takes BRAINSTORM.md and produces PRD, BRD, and SoW",
"system_prompt": "You are in planning mode. Your input is a BRAINSTORM.md file containing a project vision with MVP, v1.0, and v2.0 feature sets.\n\nYour job is to drill down with detailed questions, then produce three documents in markdown:\n\n1. **PRD (Product Requirements Document)**: User stories, functional requirements, acceptance criteria, information architecture, UX flows. Organized by feature area.\n\n2. **BRD (Business Requirements Document)**: Problem statement, goals and objectives, success metrics, stakeholders, constraints, assumptions, risks, dependencies.\n\n3. **SoW (Statement of Work)**: Scope definition, deliverables, milestones, timeline estimates, out-of-scope items, technical approach overview.\n\nBefore writing, ask a ton of clarifying questions — priorities, constraints, target users, technical preferences, timeline expectations, budget considerations. Don't assume. Challenge vague requirements. Push for specifics.\n\nUse structured formats (numbered lists, tables, requirement IDs) throughout. Each document should be standalone and complete.",
"settings": {
"model": "sonnet",
"max_turns": 50
}
}

View file

@ -1,9 +0,0 @@
{
"name": "Research",
"description": "Deep investigation and analysis mode",
"system_prompt": "You are in research mode. Investigate topics thoroughly. Check documentation, source code, and configuration files. Cross-reference information. Cite your sources (file paths, URLs). Distinguish between facts and inferences. Summarize findings clearly with actionable recommendations.",
"settings": {
"model": "sonnet",
"max_turns": 30
}
}

View file

@ -1,401 +0,0 @@
"""
Session management for Claude Code Telegram bot.
This module provides the SessionManager class that handles lifecycle management
for isolated Claude Code conversation sessions. Each session is a directory
containing metadata, persona configuration, and Claude Code's .claude/ data.
Sessions enable:
- Multiple independent Claude Code conversations
- Persona-based behavior customization
- Context isolation per conversation thread
- Session switching without losing state
Directory structure:
~/telegram/sessions/<name>/
metadata.json # Session state (status, timestamps, PID)
persona.json # Persona configuration (copied from library)
.claude/ # Auto-created by Claude Code CLI
"""
import json
import logging
import re
import shutil
import subprocess
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional
logger = logging.getLogger(__name__)
class SessionManager:
"""Manages Claude Code session lifecycle and persona library."""
SESSION_NAME_PATTERN = re.compile(r'^[a-zA-Z0-9_-]+$')
MAX_NAME_LENGTH = 50
def __init__(self, base_dir: Optional[Path] = None):
"""
Initialize SessionManager.
Args:
base_dir: Base directory for sessions. Defaults to ~/homelab/telegram/sessions/
"""
if base_dir is None:
# Use homelab directory (where bot.py lives)
homelab_dir = Path.home() / "homelab"
base_dir = homelab_dir / "telegram" / "sessions"
self.base_dir = Path(base_dir)
# Personas are always in homelab/telegram/personas
self.personas_dir = Path.home() / "homelab" / "telegram" / "personas"
self.active_session: Optional[str] = None
# Create directories if they don't exist
self.base_dir.mkdir(parents=True, exist_ok=True)
self.personas_dir.mkdir(parents=True, exist_ok=True)
# Load active session from disk if any
self._load_active_session()
logger.info(f"SessionManager initialized: base_dir={self.base_dir}")
def _load_active_session(self) -> None:
"""Load active session name from existing sessions."""
try:
sessions = self.list_sessions()
for session in sessions:
if session.get('status') == 'active':
self.active_session = session['name']
logger.debug(f"Loaded active session: {self.active_session}")
return
except Exception as e:
logger.warning(f"Failed to load active session: {e}")
def _validate_session_name(self, name: str) -> None:
"""
Validate session name.
Args:
name: Session name to validate
Raises:
ValueError: If name is invalid
"""
if not name:
raise ValueError("Session name cannot be empty")
if len(name) > self.MAX_NAME_LENGTH:
raise ValueError(
f"Session name too long (max {self.MAX_NAME_LENGTH} chars): {name}"
)
if not self.SESSION_NAME_PATTERN.match(name):
raise ValueError(
f"Invalid session name '{name}': only alphanumeric, hyphens, "
"and underscores allowed"
)
def _read_metadata(self, name: str) -> dict:
"""Read session metadata from disk."""
metadata_path = self.base_dir / name / "metadata.json"
if not metadata_path.exists():
raise ValueError(f"Session '{name}' does not exist")
with metadata_path.open('r') as f:
return json.load(f)
def _write_metadata(self, name: str, metadata: dict) -> None:
"""Write session metadata to disk."""
metadata_path = self.base_dir / name / "metadata.json"
with metadata_path.open('w') as f:
json.dump(metadata, f, indent=2)
logger.debug(f"Wrote metadata for session '{name}'")
def create_session(self, name: str, persona: Optional[str] = None) -> Path:
"""
Create a new session.
Args:
name: Session name (alphanumeric, hyphens, underscores only)
persona: Persona name from library (defaults to 'default')
Returns:
Path to created session directory
Raises:
ValueError: If session already exists or name is invalid
"""
self._validate_session_name(name)
session_dir = self.base_dir / name
if session_dir.exists():
raise ValueError(f"Session '{name}' already exists")
# Use default persona if not specified
if persona is None:
persona = 'default'
# Load persona from library
persona_source = self.personas_dir / f"{persona}.json"
if not persona_source.exists():
raise FileNotFoundError(
f"Persona '{persona}' not found in library: {persona_source}"
)
# Create session directory
session_dir.mkdir(parents=True, exist_ok=False)
logger.info(f"Created session directory: {session_dir}")
# Copy persona to session
persona_dest = session_dir / "persona.json"
persona_dest.write_text(persona_source.read_text())
logger.debug(f"Copied persona '{persona}' to session")
# Create metadata
now = datetime.now(timezone.utc).isoformat()
metadata = {
"name": name,
"created": now,
"last_active": now,
"persona": persona,
"pid": None,
"status": "idle",
"idle_timeout": 600
}
self._write_metadata(name, metadata)
logger.info(f"Created session '{name}' with persona '{persona}'")
return session_dir
def switch_session(self, name: str) -> Optional[str]:
"""
Switch to a different session.
Args:
name: Session name to switch to
Returns:
Previous active session name, or None if no previous session
Raises:
ValueError: If session does not exist
"""
self._validate_session_name(name)
if not self.session_exists(name):
raise ValueError(f"Session '{name}' does not exist")
# No-op if already active
if self.active_session == name:
logger.debug(f"Session '{name}' already active")
return None
previous_session = self.active_session
# Mark previous session as suspended
if previous_session:
try:
prev_meta = self._read_metadata(previous_session)
if prev_meta['status'] == 'active':
prev_meta['status'] = 'suspended'
self._write_metadata(previous_session, prev_meta)
logger.debug(f"Suspended previous session: {previous_session}")
except Exception as e:
logger.warning(f"Failed to suspend previous session: {e}")
# Mark new session as active
new_meta = self._read_metadata(name)
new_meta['status'] = 'active'
new_meta['last_active'] = datetime.now(timezone.utc).isoformat()
self._write_metadata(name, new_meta)
self.active_session = name
logger.info(f"Switched to session '{name}'")
return previous_session
def get_session(self, name: str) -> dict:
"""
Get session metadata.
Args:
name: Session name
Returns:
Session metadata dict
Raises:
ValueError: If session does not exist
"""
return self._read_metadata(name)
def list_sessions(self) -> list[dict]:
"""
List all sessions.
Returns:
List of session metadata dicts, sorted by last_active (most recent first)
"""
sessions = []
if not self.base_dir.exists():
return sessions
for session_dir in self.base_dir.iterdir():
if not session_dir.is_dir():
continue
metadata_path = session_dir / "metadata.json"
if not metadata_path.exists():
logger.warning(f"Session directory missing metadata: {session_dir}")
continue
try:
with metadata_path.open('r') as f:
metadata = json.load(f)
sessions.append(metadata)
except Exception as e:
logger.error(f"Failed to read metadata for {session_dir}: {e}")
continue
# Sort by last_active, most recent first
sessions.sort(
key=lambda s: s.get('last_active', ''),
reverse=True
)
return sessions
def get_active_session(self) -> Optional[str]:
"""
Get active session name.
Returns:
Active session name, or None if no active session
"""
return self.active_session
def update_session(self, name: str, **kwargs) -> None:
"""
Update session metadata fields.
Args:
name: Session name
**kwargs: Fields to update
Raises:
ValueError: If session does not exist
"""
metadata = self._read_metadata(name)
metadata.update(kwargs)
self._write_metadata(name, metadata)
logger.debug(f"Updated session '{name}': {kwargs}")
def get_session_timeout(self, name: str) -> int:
"""
Get session idle timeout in seconds.
Args:
name: Session name
Returns:
Idle timeout in seconds (defaults to 600s if not set)
Raises:
ValueError: If session does not exist
"""
metadata = self._read_metadata(name)
return metadata.get('idle_timeout', 600)
def session_exists(self, name: str) -> bool:
"""
Check if session exists.
Args:
name: Session name
Returns:
True if session exists, False otherwise
"""
session_dir = self.base_dir / name
return session_dir.exists() and (session_dir / "metadata.json").exists()
def get_session_dir(self, name: str) -> Path:
"""
Get session directory path.
Args:
name: Session name
Returns:
Path to session directory
"""
return self.base_dir / name
def archive_session(self, name: str) -> Path:
"""
Archive a session by compressing it with tar+pigz and removing the original.
Args:
name: Session name to archive
Returns:
Path to the archive file
Raises:
ValueError: If session does not exist
"""
if not self.session_exists(name):
raise ValueError(f"Session '{name}' does not exist")
# Clear active session if archiving the active one
if self.active_session == name:
self.active_session = None
# Create archive directory
archive_dir = self.base_dir.parent / "sessions_archive"
archive_dir.mkdir(parents=True, exist_ok=True)
# Build archive filename with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
archive_name = f"{name}_{timestamp}.tar.gz"
archive_path = archive_dir / archive_name
# Compress with tar + pigz
session_dir = self.base_dir / name
subprocess.run(
["tar", "--use-compress-program=pigz", "-cf", str(archive_path),
"-C", str(self.base_dir), name],
check=True,
)
# Remove original session directory
shutil.rmtree(session_dir)
logger.info(f"Archived session '{name}' to {archive_path}")
return archive_path
def load_persona(self, name: str) -> dict:
"""
Load persona from library.
Args:
name: Persona name
Returns:
Persona configuration dict
Raises:
FileNotFoundError: If persona does not exist
"""
persona_path = self.personas_dir / f"{name}.json"
if not persona_path.exists():
raise FileNotFoundError(
f"Persona '{name}' not found in library: {persona_path}"
)
with persona_path.open('r') as f:
return json.load(f)

View file

@ -1,166 +0,0 @@
"""
Telegram message formatting and UX utilities.
Provides smart message splitting, MarkdownV2 escaping, and typing indicator
management for the Telegram Claude Code bridge.
Based on research in: .planning/phases/02-telegram-integration/02-RESEARCH.md
"""
import asyncio
import logging
import re
from telegram.constants import ChatAction
logger = logging.getLogger(__name__)
TELEGRAM_MAX_LENGTH = 4096
SAFE_LENGTH = 4000 # Leave room for MarkdownV2 escape character expansion
def split_message_smart(text: str, max_length: int = SAFE_LENGTH) -> list[str]:
"""
Split long message at smart boundaries, respecting MarkdownV2 code blocks.
Never splits inside triple-backtick code blocks. Prefers paragraph breaks
(\\n\\n), then line breaks (\\n), then hard character split as last resort.
Uses 4000 as default max (not 4096) to leave room for MarkdownV2 escape
character expansion.
Args:
text: Message text to split
max_length: Maximum length per chunk (default: 4000)
Returns:
List of message chunks, each <= max_length
Example:
>>> split_message_smart("a" * 5000)
['aaa...', 'aaa...'] # Two chunks, each <= 4000 chars
"""
if len(text) <= max_length:
return [text]
chunks = []
current_chunk = ""
in_code_block = False
lines = text.split('\n')
for line in lines:
# Track code block state
if line.strip().startswith('```'):
in_code_block = not in_code_block
# Check if adding this line exceeds limit
potential_chunk = current_chunk + ('\n' if current_chunk else '') + line
if len(potential_chunk) > max_length:
# Would exceed limit
if in_code_block:
# Inside code block - must include whole block
# (Telegram will handle overflow gracefully or we truncate)
current_chunk = potential_chunk
else:
# Can split here
if current_chunk:
chunks.append(current_chunk)
current_chunk = line
else:
current_chunk = potential_chunk
if current_chunk:
chunks.append(current_chunk)
return chunks
def escape_markdown_v2(text: str) -> str:
"""
Escape MarkdownV2 special characters outside of code blocks.
Escapes 17 special characters: _ * [ ] ( ) ~ ` > # + - = | { } . !
BUT does NOT escape content inside code blocks (triple backticks or single backticks).
Strategy: Split text by code regions, escape only non-code regions, rejoin.
Args:
text: Text to escape
Returns:
Text with MarkdownV2 special characters escaped outside code blocks
Example:
>>> escape_markdown_v2("hello_world")
'hello\\_world'
>>> escape_markdown_v2("`hello_world`")
'`hello_world`' # Inside backticks, not escaped
"""
# Characters that need escaping in MarkdownV2
escape_chars = r'_*[]()~`>#+-=|{}.!'
# Pattern to match code blocks (triple backticks) and inline code (single backticks)
# Match triple backticks first (```...```), then single backticks (`...`)
code_pattern = re.compile(r'(```[\s\S]*?```|`[^`]*?`)', re.MULTILINE)
# Split text into code and non-code segments
parts = []
last_end = 0
for match in code_pattern.finditer(text):
# Add non-code segment (escaped)
non_code = text[last_end:match.start()]
if non_code:
# Escape special characters in non-code text
escaped = re.sub(f'([{re.escape(escape_chars)}])', r'\\\1', non_code)
parts.append(escaped)
# Add code segment (not escaped)
parts.append(match.group(0))
last_end = match.end()
# Add remaining non-code segment
if last_end < len(text):
non_code = text[last_end:]
escaped = re.sub(f'([{re.escape(escape_chars)}])', r'\\\1', non_code)
parts.append(escaped)
return ''.join(parts)
async def typing_indicator_loop(bot, chat_id: int, stop_event: asyncio.Event):
"""
Maintain typing indicator until stop_event is set.
Sends ChatAction.TYPING every 4 seconds to keep indicator alive for
operations longer than 5 seconds (Telegram expires typing after 5s).
Uses asyncio.wait_for pattern with timeout to re-send every 4 seconds
until stop_event is set.
Args:
bot: Telegram bot instance
chat_id: Chat ID to send typing indicator to
stop_event: asyncio.Event to signal when to stop
Example:
>>> stop_typing = asyncio.Event()
>>> task = asyncio.create_task(typing_indicator_loop(bot, chat_id, stop_typing))
>>> # ... long operation ...
>>> stop_typing.set()
>>> await task
"""
while not stop_event.is_set():
try:
result = await bot.send_chat_action(chat_id=chat_id, action=ChatAction.TYPING)
logger.info(f"Typing indicator sent to chat_id={chat_id}, result={result}")
except Exception as e:
logger.warning(f"Failed to send typing indicator: {e}")
# Wait 4s or until stop_event (whichever comes first)
try:
await asyncio.wait_for(stop_event.wait(), timeout=4.0)
break # stop_event was set
except asyncio.TimeoutError:
continue # Timeout, re-send typing indicator